Evi Zouganeli Anis Yazidi Gustavo Mello Pedro Lind (Eds.)

Communications in Computer and Information Science 1650

# **Nordic Artificial Intelligence Research and Development**

4th Symposium of the Norwegian AI Society, NAIS 2022 Oslo, Norway, May 31 – June 1, 2022 Revised Selected Papers

# **Communications in Computer and Information Science 1650**

Editorial Board Members

Joaquim Filipe *Polytechnic Institute of Setúbal, Setúbal, Portugal*

Ashish Ghosh *Indian Statistical Institute, Kolkata, India*

Raquel Oliveira Prates *Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil*

Lizhu Zhou

*Tsinghua University, Beijing, China*

More information about this series at https://link.springer.com/bookseries/7899

Evi Zouganeli · Anis Yazidi · Gustavo Mello · Pedro Lind (Eds.)

# Nordic Artificial Intelligence Research and Development

4th Symposium of the Norwegian AI Society, NAIS 2022 Oslo, Norway, May 31 – June 1, 2022 Revised Selected Papers

*Editors* Evi Zouganeli Department of Mechanical, Electronics, and Chemical Engineering Oslo Metropolitan University Oslo, Norway

Gustavo Mello Department of Computer Science Oslo Metropolitan University Oslo, Norway

Anis Yazidi Department of Computer Science Oslo Metropolitan University Oslo, Norway

Pedro Lind Department of Computer Science Oslo Metropolitan University Oslo, Norway

ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer and Information Science ISBN 978-3-031-17029-4 ISBN 978-3-031-17030-0 (eBook) https://doi.org/10.1007/978-3-031-17030-0

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## **Preface**

This volume contains the papers presented during the 2022 Symposium of the Norwegian AI Society (NAIS 2022) that was held at Oslo Metropolitan University (OsloMet), in Oslo, during May 31 – June 1, 2022, and organized jointly by OsloMet and Simula Metropolitan (SimulaMet). The NAIS Symposium was held for the fourth time, and the second time since 2010. The previous symposium was held in Trondheim in 2019 as the COVID-19 pandemic forced us to cancel in 2020 and 2021. The symposium aims at bringing together researchers and practitioners in the field of artificial intelligence (AI) from Norway and Scandinavia to present ongoing work and discuss the future of AI. With the symposium, NAIS provides a forum for networking among researchers as well as building links with related research fields, practitioners, businesses, and the public sector.

This year there were 17 submissions. Each submission was reviewed by at least two Program Committee members as well as one of the symposium co-chairs. The quality of the submissions was very high, and 11 papers were finally accepted for publication, which were presented in four technical sessions – two on Tuesday, May 31, and two on Wednesday, June 1. The program included four invited keynote talks and a commercial pitch and panel session. In addition, three tutorials were offered before the start and after the end of the main event.

The symposium started at noon on Tuesday with a welcome address by symposium co-chair Evi Zouganeli. The first part of the symposium was dedicated to applied AI and robotics for real-life applications. The first keynote was given by Filippo Sanfilippo, University of Agder, who discussed different technologies, types of robots, and their applications including in Industry 5.0, wearable robotics, intelligent health, humanrobot interaction and collaboration, and search-and-rescue scenarios. This keynote was the perfect introduction to the following pitch and panel discussion on "AI- and roboticsenabled systems – status, barriers, and timeline for deployment in real-life systems". The session was introduced by Evi Zouganeli and included three pitch presentations. Nils Jacob Berland, CEO, presented the autonomous power-line inspection solution by Bergen Robotics AS. He discussed the capabilities of the drone-based system, and the hurdles encountered on the way to development and real deployment. Audun Sanderud, CEO, presented the social robotics solution by Hiro Futures AS. He discussed the rationale and the technology behind robots that engage human like body language to facilitate communication and interaction. Asgeir Berland, Lead Data Scientist, presented the warehouse logistics and groceries distribution solution by Oda AS. He discussed how AI powers efficient resource management, route optimization, and distribution of fresh goods in Norway.

Thereafter, the three commercial representatives, together with Filippo Sanfilippo, had a panel discussion that was moderated by Trym Lindell, OsloMet. Amongst other things, the discussion revolved around the transformative power of AI for our society, and the potential for significant new value creation. What decides the timeline for deployment – is it management, market acceptance, or technical maturity? The panelists discussed having underestimated various technical challenges that may appear trivial but can delay or even hamper real deployments. Aspects that are trivial for humans are extremely challenging in artificial systems, where a reliable and safe operation is required in the vicinity of humans. Explainability, trustworthiness, safety, and regulation were mentioned as important stepstones. In addition, the discussion touched upon the effect of AI-uptake on the job market – our panelists seemed to agree that robots will be assigned the boring repetitive tasks and allow the humans to shine. Overall, the panel shed new and insightful light on important topics around the roadmap towards real-life deployment.

Two technical sessions followed, one on Robotics and Intelligent Systems that was chaired by Kai Olav Ellefsen, University of Oslo, and one on AI in Cyber and Digital Sphere that was chaired by Lothar Fritch, OsloMet. The first day was rounded off by the second keynote on "AI Research and Europe's Upcoming AI Law" by Tobias Mahler, University of Oslo. The keynote discussed the legal regulation of AI that is under development in Europe, the Artificial Intelligence Act (AIA), shedding light on whether the law, if adopted, will facilitate the creation of trustworthy AI in Europe, or whether it might limit Europe's ability to develop competitive AI systems. The keynote led to an engaging discussion in plenum, where among other things, what is defined as AI in the regulation as well as security and safety aspects seemed to interest the audience. Afterwards, the Norwegian AI Society had a short General Assembly meeting, and then the participants walked down to the Norwegian Opera and Ballet and enjoyed a good dinner at Brasserie Sanguine.

Day two started with an inspiring keynote by Kjersti Aas, Big Insight Centre for Research-based Innovation, about ongoing work at Big Insight. Examples included the application of AI and machine learning in credit scoring, anomaly detection, and detection of money laundering as well as work on explainable AI. The talk was followed by a technical session on AI in Biological Applications and Medicine, chaired by Michael Riegler, SimulaMet. The first paper of this session received the Best Paper award based on the review evaluation score. After a hearty coffee break, the event resumed with a technical session on New AI Methods, chaired by Pedro Lind, OsloMet. The event was rounded off by a final keynote by Robert Jenssen, The Arctic University of Norway. The talk presented work from the Visual Intelligence Centre for Research-based Innovation, focusing on the development of new methods for learning from limited data, e.g. semisupervised learning, few-shot learning, and self-supervised learning, and on explainable AI – for applications ranging from fish detection to medical imaging. The main event was concluded by a short thank-you talk by co-chairs Anis Yazidi and Evi Zouganeli, OsloMet. After that, the participants could mingle over a networking buffet lunch.

Three tutorials were offered. The first one took place prior to the main event on the morning of the first day; it was entitled "Search Algorithms in AI with Python" and delivered by by Rashmi Gupta and Morten Goodwin, University of Agder. The other two tutorials took place in parallel after the end of the main event, on the afternoon of day two. The second tutorial was entitled "Goal! A practical guide to soccer video understanding" and presented by A. Cioppa, S. Giancola, A. Deliege, F. Magera, V. Somers, Le Kang, Xin Zhou, B. Ghanem, and M. Van Droogenbroeck representing Soccer Net. The third tutorial was entitled "The past, present, and future of XAI", and delivered by Kristoffer Wickstrøm, The Arctic University of Norway.

We are grateful to the Norwegian Research Council for funding the event, which did not require a fee this year. The success of the symposium would not be possible without the help of many colleagues. We would like to thank the Technical Program Committee for reviewing the papers and giving feedback to the authors. The Organizing Committee from OsloMet and SimulaMet acted, in effect, as the event Program Committee, and we would like to thank all colleagues for their commitment. We are grateful to the Artificial Intelligence Lab, the Department of Computer Science, and the Department of Mechanical, Electronics, and Chemical Engineering at OsloMet for supporting the event. We also thank the Course and Conference Centre (KK-senter) and Technical Support at OsloMet for their valuable assistance.

Last but not least, we would like to thank all participants at the symposium, including authors, speakers, keynote speakers, panelists, and session chairs – for presenting their work, engaging in discussions, actively participating in a lively exchange, and supporting the AI community.

June 2022 Evi Zouganeli Anis Yazidi Gustavo Mello Pedro Lind

## **Organization**

#### **General Co-chairs**


#### **Organizing Committee**


#### **Technical Program Committee**


Rabindra Khadka Ymail, Norway Youcef Djenour SINTEF, Norway

Pedro Lind Oslo Metropolitan University, Norway Andrea T. Marheim Storås Simula Metropolitan, Norway Akriti Sharma Oslo Metropolitan University, Norway Filippo Sanfilippo University of Agder, Norway Hårek Haugerud Oslo Metropolitan University, Norway Arvind Keprate Oslo Metropolitan University, Norway Ismail Hassan Oslo Metropolitan University, Norway Leonardo Rydin Oslo Metropolitan University, Norway Michael Tarlton Oslo Metropolitan University, Norway Marija Slavkovik University of Bergen, Norway

## **Contents**

#### **Robotics and Intelligent Systems**



#### **Towards New AI Methods**


# **Robotics and Intelligent Systems**

# Knowledge Infused Representations Through Combination of Expert Knowledge and Original Input

Daniel Biermann(B) , Morten Goodwin, and Ole-Christoffer Granmo

Centre for Artificial Intelligence Research (CAIR), Department of ICT, University of Agder, Grimstad, Norway daniel.biermann@uia.no

Abstract. Sophisticated applications in natural language processing, such as conversational agents, often need to be able to generalize across a range of different tasks to generate natural-feeling language. In this paper, we introduce a model that aims to improve generalizability with regard to different tasks by combining the original input with the output of a task-specific expert. Through a combination mechanism, we create a new representation that has been enriched with the information given by the expert. These enriched representations then serve as input to a downstream model. We test three different combination mechanisms in two combination paradigms and evaluate the performance of the new enriched representation in a simple encoder-decoder model. We show that even very simple combination mechanisms are able to significantly improve performance of the downstream model. This means that the encoded expert information is transported through the new enriched input representation, leading to a beneficial impact on performance within the task domain. This opens the way for exciting future endeavors such as testing performance on different task domains and the combination of multiple experts.

Keywords: Artificial neural networks · Natural language processing · Knowledge representation · Knowledge transfer

## 1 Introduction

In the field of natural language processing (NLP), conversational agents or chatbots are of ongoing interest. Challenges like the Amazon Alexa prize challenge<sup>1</sup> further incentivise research on chatbots in open-domain settings such as dayto-day conversation. A significant challenge in open-domain settings is the wide field of tasks these conversational agents encounter. For example, in a day-today conversation, a chatbot might need to simultaneously generate grammatically correct sentences while identifying different types of sentences (dialogue act

<sup>1</sup> https://developer.amazon.com/alexaprize.

c The Author(s) 2022

E. Zouganeli et al. (Eds.): NAIS 2022, CCIS 1650, pp. 3–15, 2022. https://doi.org/10.1007/978-3-031-17030-0\_1

classification), recognizing intent (intent classification) and answering questions (question answering).

Transfer learning is the field of using the knowledge of an intelligent agent trained in one task for another task. It is of natural interest to the field of NLP as all tasks share the underlying concept of language. This mainly shows in the practice of pre-training models on large text corpora to generate contextualized word representations, i.e. ELMo [12]. Since the inception of the Transformer model [18], the Transformer's efficiency prompted a trend in research to improve performance by pre-training Transformer-based models of rapidly increasing size on vast sets of unlabeled data and fine-tuning them for a specific task. Prominent examples are the GPT architectures [1,13,14] as well as BERT architectures (e.g. [3,9,16]) and XLNet [20]. The problem with these architectures are the massive costs of pretraining. The costs have already reached regions in which only corporations like Google, Facebook, etc. can afford to train these large models from the ground up.

Next to the pretraining-finetuning approaches, Mixture-of-Expert (MoE) and other ensemble methods are of particular interest for transfer learning. The idea behind ensemble models is to combine an ensemble of distinct experts in a way that the different experts offset the weaknesses of the other experts and elevate the overall architecture to a better and more robust performance, possibly across different tasks.

In this paper we propose a new, ensemble-based architecture that combines task-specific expert output with the initial input representation to form a new expert-information-enriched representation to serve as input for a downstream task model. Meaning, we combine the output of an expert solving a specific task with the original input word embeddings. Our model utilizes, in contrast to other ensemble models, an already trained expert whose output shape differs significantly from the original input shape. Furthermore, we explore in our proposed architecture different combination methods that are based on attention and RNNs. Additionally, we explore these methods in a dimensional- and sequential combination paradigm.

## 2 Related Work

The idea to combine seperate experts has been explored since the 90's [7,8]. Early renditions of MoE models used a gating function to decide which expert output is further propagated. Recent MoE research pushed the concept of sparselyactivated models such as the Switch-Transformers [5], enabling efficient models with trillions of parameters. MoE models mainly aim at creating sparse models where each incoming example is processed by different parameters, thus, possibly training different parameter sets for different tasks. This is in contrast to dense networks in which the parameters are shared for each input. Our approach differs from these MoE models in that the experts are already trained and can have different architectures and output shapes. In MoE models, the experts often have the same architecture and output shape and have to be trained.

Using ensemble models to create new word embeddings has been the subject of previous research. [10] combined different word embeddings by ordinary least squares regression and by solving the orthogonal Procrustes problem while [21] creates word meta-embeddings by combining different word embeddings via different ensemble methods. Recently, [4] employed an attention network to combine semantic lexical information of knowledge graphs and pre-trained word embeddings in an ensemble method. The method proposed in our work differs from these previous approaches. The biggest difference is that the mentioned works aimed at creating general word embeddings instead of task specific embeddings. By task-specific embeddings we mean a vector representation that is infused with the output of an expert solving a specific task. Thus, the representations generated in this work are created with specific tasks is mind. Creating task specific embeddings allows for a more flexible use of the architecture as we can tailor the experts that we choose to combine to the downstream task. Additionally, we use Transformer-base attention mechanisms to combine the original input with the expert output. Rather than creating new general word embeddings, we infuse the original word embedding with focused task-specific information in form of the output of task-specific experts.

#### 3 Methods

#### 3.1 Model

Fundamentally, our architecture resembles a classic encoder-decoder model. The encoder consists of the pre-trained expert and the combination mechanism, and generates the new enriched word-knowledge representation. The decoder consists of a downstream task model that is to be trained to perform its downstream task.

In the encoder, we present the input embedding to the expert which subsequently calculates the output. The original input embedding and expert output are then concatenated and passed towards the combination mechanism. The combination mechanism calculates the expert-knowledge-enriched representation that has the same dimensionality as the original input embedding. The idea of enforcing the same dimensionality is to further support the modular structure of the architecture. This way, the expert combination process can be easily interjected between the original word embedding and the downstream model without having to change the downstream model. This input embedding is then used as input for the decoder. The general structure is outlined in Fig. 1.

In general, the expert and downstream model can be arbitrary models of arbitrary tasks with the experts already trained. The expert is regarded a finished model and is NOT trained in our architecture. The idea is to be able to make use of old already trained models and available pre-trained models to improve performance of the downstream model either in the same or a different task.

In this paper, we explore the simplest case of combining 1 expert that has the same task domain as the downstream model. We choose the Context-Aware Self Attention dialogue act classifier model (CASA) [15] as an expert. Compared to the original CASA model, we only use pre-trained Glove vectors [11] as word embeddings for the expert and replace the CRF classifier with a softmax classifier with 1000 hidden units. We test different combination methods and paradigms that are described in more detail below.

Fig. 1. Model architecture. Experts are pre-trained task-specific models. Downstream models are arbitrary, to-be-trained models. The combination mechanism combines the expert output and original input into a new enriched representation.

The downstream model consists of a single GRU (one-directional) layer [2] followed by a softmax classifier with 64 hidden units. We train the downstream model on the same task and dataset as the CASA expert.

When training the downstream model on the same task and data as the expert, we technically do not perform transfer learning as the task domains are the same. Nevertheless, by using a sophisticated, well-performing expert and a worse-performing, simple classifier we can test whether the task-knowledge infused in the enriched knowledge representation translates to a better performance in a simple model.

Fig. 2. Illustration of the dimensional and sequential combination paradigms.

#### 3.2 Combination Paradigms

In our architecture we explore two different combination paradigms: Dimensional and sequential. These paradigms are illustrated in Fig. 2.

*Dimensional Paradigm.* In the dimensional paradigm, the expert output that has the number of classes as dimension is concatenated with the input embedding of each token in the input sequence, leading to the dimensionality *demb* + *dclass*. This concatenated vector is then presented to the combination mechanism as its input representation.

*Sequential Paradigm.* In the sequential paradigm, the expert output is appended to the list of tokens in the input sequence. For that, the output of the expert of dimension *dclass* is projected to the embedding dimension *demb* using a simple fully connected feedforward layer and added to the sequence. A sequence of length *N* becomes a sequence of length *N* + 1.

Thus, the combination mechanisms are presented with the challenge of reducing the dimensionality in the dimensional paradigm and reducing the sequence length in the sequential paradigm.

#### 3.3 Combination Mechanisms

We test our model with three different combination methods. The first two mechanism are the scaled dot-product attention and multi-head attention introduced with the Transformer model [18] and the third consists of a simple recurrent network.

*Mutli-head Attention.* The first mechanism uses multi-head attention. Revisiting the attention definitions in [18] gives us:

$$\mathcal{A}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d\_k}})V \tag{1}$$

$$\mathcal{M}(Q, K, V) = \text{Concat}(H\_1, \dots, H\_h)W^O \tag{2}$$

$$H\_i = \mathcal{A}(QW\_i^Q, KW\_i^K, VW\_i^V) \tag{3}$$

where *Q*, *K* and *V* are query, key and value matrices with dimensionalities *dk*, *<sup>d</sup><sup>q</sup>* and *<sup>d</sup>v*, respectively. <sup>A</sup> and <sup>M</sup> denote the scaled-dot product and multi-head attention. The multi-head attention mechanism consists of multiple heads *H<sup>i</sup>* that compute the scaled-dot product in parallel. Each head has their own *Q*, *K* and *V* matrices and produces outputs of dimension *dv/h* with the number of heads *h*. The outputs are then concatenated and projected up to *d<sup>v</sup>* via *W<sup>O</sup>*.

In the dimensional paradigm we want *d<sup>v</sup>* to be of the same dimension as the original input *demb* to reduce the concatenated dimensions back to the embedding dimension. While in principle the attention mechanism allows to rescale the dimension by choosing *dv*, the multi-head attention requires that *d<sup>k</sup>* = *d<sup>q</sup>* and *d<sup>v</sup>* can be divided by the number of heads. This makes rescaling by *d<sup>v</sup>* impracticable in our model as we can not always choose the output dimensions of our experts. For the dimensional paradigm, it is therefore beneficial to follow the general practice to set *d<sup>k</sup>* = *d<sup>q</sup>* = *d<sup>v</sup>* = *demb* + *dclass* and rescale by changing the dimension of *W<sup>O</sup>*.

In case of the sequential paradigm, we do not want to change the dimension. We calculate the attention on the sequence *N* + 1 and drop the last sequence element.

*Scaled Dot-Product Attention.* Setting the number of heads in multi-head attention to *h* = 1 yields the scaled-dot product.

*RNN.* The third mechanism consist of a simple bi-directional GRU layer with its concatenated last hidden dimensions equaling the original embedding dimension. The hidden state after the last token in the sequence serves as the new knowledge infused representation. For the sequential paradigm, we require the RNN to be bi-directional as we have to drop the last hidden state. If the RNN were one-directional, dropping the last hidden state would also drop all the expert information.

## 4 Experiments

We train the downstream DA classifier model for each combination method and paradigm. The results are shown in Table 2. As baseline, we have the simple classifier and CASA model that were each trained and evaluated with the unaltered GloVe embeddings as input. Additionally, we trained combination mechanism baseline models by removing the expert from the model. The purpose of this is to get a better understanding whether any performance improvement is due to additional parameters the combination mechanism introduces to the model or the information of the expert.

Each model was trained until convergence with a patience of 30. The 5 best model iterations with regard to validation accuracy were saved. The results given in Table 2 show the averaged test accuracies.

Fig. 3. Heatmap and attention visualization for the multi-head attention weights in both combination paradigms. The attention weights depicted have been averaged over all heads. Attention visualization created via *BertViz* [19]

#### 4.1 Data

We train all models on the Switch-Board dialect corpus (SwDA) [6,17] <sup>2</sup>. The dataset consists of conversations which contain sequences of sentences. We follow the train, validation and test splits given in the official paper.

After removing the non-verbal instances from the dataset, the corpus consists of *nclass* = 41 classes. The class frequency across the whole dataset is significantly imbalanced. To improve training, we calculate the cross-entropy loss with class weights. The class weights are inversely proportional to the frequency of the class.

We load the data in conversations. This means that the sentences within a conversation are always presented in the same order, thus retaining their contextual information. During training, we load the conversations in random order.

For the word embedding, we choose the *demb* = 300 dimensional GloVe vector trained on Wikipedia 2014 + Gigaword: 'glove.6B'.

#### 4.2 Hyperparameters

The used hyperparameters are summarised in Table 1. The combination mechanism models share the same hyperparameters as the simple classifier as the combination mechanism itself is defined by *demb*. The learning rate was kept constant until epoch = 50 after which it was scaled by a factor <sup>√</sup> 1 *epoch* . For the combination mechanism baseline models the learning was kept constant at 0*.*00001. No hyperparameter tuning was performed. The hyperparameters were chosen to represent standard values used in machine learning. The hyperparameters for the CASA classifier follow [15]


Table 1. List of hyperparameters.

#### 4.3 Results

As shown in Table 2, all combination models show a significant improvement in performance compared to the simple classifier. In addition, the combination models also show a significant improvement when compared to their baseline performance.

<sup>2</sup> This work uses the pre-cleaned dataset files provided in https://github.com/ NathanDuran/Switchboard-Corpus.


Table 2. Dialogue act classification accuracies

The simple classifier is able to reach an accuracy of 69*.*25. This low accuracy is expected as we chose a deliberately simple downstream model. We can also observe that the combination baseline models reach similar accuracies to the simple classifier. This solidifies that the significant performance improvement is not an artifact of the additional trainable parameters that the combination mechanism introduces. For the multi-head attention and RNN we only see small improvements to the accuracy. The performance worsens for the scaled dotproduct. This suggests that a single application of the scaled dot-product might be too simple and has a detrimental effect on the information present in the pre-trained GloVe embeddings.

Nevertheless, when given outputs from an expert, all combination models in both combination paradigms significantly increase the performance and push the accuracy into the regime of the expert of ∼75%. This means that the information present in expert output is successfully infused into the new representation that we pass onto the downstream model. In case of the Multi-head attention mechanism in the dimensional paradigm, the performance equals the CASA baseline performance of 75*.*03%. This might indicate that the new representation has incorporated all information from the expert and carried it over to the downstream model so that it reaches equal performance. Whether the performance of the simple downstream model fed with the expert infused representations can exceed the performance of the expert or if the expert baseline represents a performance ceiling for the downstream model is subject of future work.

Figure 3 shows the visualization of the multi-head attention weights for an example sentence for both combination paradigms. The weights are visualized as a heatmap and using the *BertViz* visualization tool.

For the dimensional paradigm, the influence of the expert output can not be made visible by attention as we infuse every token with the expert knowledge. Thus, every token carries the same expert information. Nevertheless, it can be seen that for a question, a significant part of the attention is put on the '?' token as well as the 'you' token. In attention models, we usually see more variation in the weights of single words instead of entire columns. This means that certain words carry over strongly into all new token representations. We suspect that this behavior is due to using only a single layer in the attention mechanism. Infusing all token representations with the same expert information might emphasize this effect as the combination of expert information and original token could combine into a 'universally good' or 'bad' representations. Thus, 'universally good' representations carry large weights for all new representations. The sequential attention weight heatmap does not show such a pronounced column wise attention proclivity. While the heatmap shows the significant influence of the expert output, it offers slightly more variation in weights across distinct words instead of columns (with the exception of the expert output column). This indicates that we have successfully created new word embeddings that have been infused with knowledge by paying attention to the relevant expert token. While the expert token dominates the attention weights, it can be seen that some tokens also pay attention to other tokens than the expert token. This means that the original word embedding also contributes to the new word embedding. Comparing the visualizations of the two paradigms makes the advantage of the sequential paradigms on explainability immediately obvious. While we have to speculate on what the effects of the expert are on the combination process in the dimensional paradigm, in the sequential paradigm, we can immediately see the effect of the expert output through attention itself.

Across the different paradigms the combination models perform similarly well and no clear paradigm or model outperforms the others. The multi-head attention reaches the best performance in the dimensional paradigm with an accuracy of 75*.*03 which is equal to the expert performance. Though, no sensible conclusion or insight can be gained from comparing the combination model accuracies as the differences between them are negligible. Apart from retaining the explainability of attention in the sequential paradigm, no clear preference of paradigms can be made with regard to performance.

#### 4.4 Future Potential of Model

We expect that the performances will start to diverge once more sophisticated combination mechanism are employed. In our exploration, we deliberately limited our models to the simplest possible variants of the presented combination mechanisms. If the performance increase can be seen for the simplest models, it is a reasonable expectation that it will also work for more sophisticated models.

A preference of paradigm might emerge regarding computational cost as parameter space scales differently with increasing expert numbers for each paradigm. The dimensional paradigm grows faster in the trainable parameter space due to the query, key and value weight matrices that grow with increasing expert output dimensions. The sequential paradigm does not affect the query, key and value weight matrices but adds additional feedforward layers and computation calls for each expert. Nevertheless, this is an additive cost in model size for each expert instead of a multiplicative one. Thus, it can be expected that the sequential paradigm might gain an advantage when combining larger numbers of experts.

## 5 Conclusion

We developed a simple ensemble based architecture that creates knowledge infused representations by combining the original input with the output of a pre-trained task-specific expert. We tested this infusion process for different combination methods and paradigms. The proof of concept that this architecture is able to create knowledge infused representation opens up several exciting future research directions. We saw that knowledge infused representations improved the performance of deliberately simple downstream models. This opens exciting opportunities to simplify training of new models as we can use already trained or pre-trained models to improve the performance of simpler models. In a way, this method can be understood as a combination of an ensemble model and a pretraining-finetuning approach.

In future work, we would like to train the downstream model and expert on different tasks to investigate the architectures true transfer learning capabilities. A natural next step would be to increase the number of experts and explore the architectures ability to perform multitask learning as well as investigate the scaling behavior of the two different combination paradigms. The exploration of more sophisticated combination models is also of interest. Of particular interest is also the question whether the performance of this approach is fundamentally capped by the performance of the experts or if the combination process is able to elevate the performance beyond the experts baseline performance. In contrast to the proof of principle investigation presented in this paper, a next step is a more systematic investigation to achieve the best performance and compare it with other state-of-the-art models.

Overall, the approach of infusing already trained expert knowledge into original pre-trained representations has the potential to offer great benefits to the fields of transfer learning. The ability to combine distinct experts into expert-sets that have been selected with a specific task in mind could offer great task-specific performance gains.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Cognitive Robotics - Towards the Development of Next-Generation Robotics and Intelligent Systems

Evi Zouganeli<sup>1</sup> and Athanasios Lentzas1,2(B)

Abstract. In this paper we make the case for cognitive robotics, that we consider a prerequisite for next generation systems. We give a brief account of current cognition-enabled systems, and viable cognitive architectures, discuss system requirements that are currently not sufficiently addressed, and put forward our position and hypotheses for the development of next-generation, AI-enabled robotics and intelligent systems.

Keywords: Artificial cognition *·* Robotics *·* Intelligent systems

## 1 Introduction

Robots, and artificial systems more generally, are gradually evolving towards intelligent machines that can function autonomously in the vicinity of humans and interact directly with humans – e.g. drive our cars, work together with humans, or help us with everyday chores. Current artificial systems are good at performing relatively limited, repetitive, and well-defined tasks under specific conditions, however, anything beyond that requires human supervision. At the moment, it is not quite possible to deploy robots in new environments, broaden the scope of their operation, and allow them perform diverse tasks autonomously, as systems are not versatile, safe, nor reliable enough for that. Pre-programmed and pre-configured robots lack the ability to adapt, learn new tasks, and adjust to new domains, conditions, and missions.

Cognitive robotics is a multidisciplinary research field that has gained increased interest recently as it has become apparent that an advanced system architecture is a prerequisite for progressing from specialized "caged" systems to real-life autonomous systems [10]. Cognition encompasses the mental functions by which knowledge is acquired, retained, and used: perception, learning, memory, and thinking [25]. In humans, it encompasses processes such as judgment and evaluation, reasoning and computation, problem solving and decision making, comprehension, and production of language.

In order to realize such functionality in artificial systems, one needs to define an architecture that describes and governs these processes. Such system architectures are inspired by human cognition. They comprise the necessary modules

<sup>1</sup> OsloMet - Oslo Metropolitan University, Oslo, Norway {evizou,nasoslen}@oslomet.no <sup>2</sup> Aristotle University of Thessaloniki, Thessaloniki, Greece

for taking care of individual processes at many levels, and for overall system operation, as well as define the way information flow takes place for knowledge acquisition, reasoning, decision making, and detailed task execution. Ideally, a cognitive robot shall be able to abstract goals and tasks, combine and manipulate concepts, synthesize, make new plans, learn new behaviour, and execute complex tasks - abilities that at the moment only humans acquire, and lie in the core of human intelligence. Cognitive robots shall be able to interact safely and meaningfully and collaborate effectively with humans. Cognition-enabled robots should be able to infer and predict the human's task intentions and objectives, and provide appropriate assistance without being explicitly asked [24].

In this article we present work in progress, and our approach to cognitive robotics for next-generation systems. Our approach builds on two hypotheses/positions: i) Artificial Intelligence requires a robust cognitive architecture in order to become intelligent enough to be deployed in real-life systems in the vicinity of humans – interacting safely and meaningfully, and collaborating with humans. ii) Artificial cognitive systems need to encompass some of the processes of the right hemisphere of the human brain - such as holistic evaluation, holistic perception, intuition, imagination, and moral evaluation and reasoning.

We elaborate on these in this paper that is organized as follows. Firstly, we give an account of current cognition-enabled systems in Sect. 2. In Sect. 3 we outline a selection of cognitive architectures, and then proceed to presenting our approach and positions in Sect. 4. Finally, we conclude in Sect. 5.

## 2 Cognition-Enabled Robotics

Artificial cognitive systems are nowhere near human cognition at the moment, however, isolated narrow-scope cognitive functionality has been implemented in robotic systems to enable their operation. Cognition can be visualized as a pyramid [40] (Fig. 1) that models the flow of sensory input and information to realise cognitive functions and processes. The main cognitive processes are [3]: *Attention*, *Language*, *Learning*, *Memory*, *Perception*, *Thought*, and *Emotion*. Simpler processes, mostly related with behavioral elements closest to the sensory input, are at the base of the pyramid. As we move towards the top of the pyramid, more advanced and complex cognitive processes are found.

Perception is important for cognition as it provides agents with relevant information from their environment. A plethora of sensors are exploited in current systems, ranging from sensors simulating human senses (cameras, microphones etc.) [7,11], to ambient sensors and IoT devices [9]. Beyond simple object recognition, advanced perception attempts to analyze the whole scene and reason on the content of the scene [31]. Scene understanding has been used for knowledge acquisition in ambiguous situations [23].

Language-based cognitive capability has been shown to promote interaction, communication and understanding of abstract concepts [16]. Robots able to express thoughts and actions allow a better cooperation with humans [44]. An agent with the ability to summarize its actions and gain new knowledge has been demonstrated [14].

Fig. 1. Objective pyramid of cognition [40].

Learning is the core function of a cognitive system [34]. Agents can learn from expert demonstration through Imitation Learning [17], an approach that is under development. Transfer Learning is another common approach that also allows training in a simulated or protected environment [22]. Learning is currently closely woven with sensory-motor inputs and outputs, data processing, and perception, hence primarily limited to the lower layers of the cognition pyramid (Fig. 1).

The pinnacle of cognition is thinking, reasoning, decision making, planning. Reactive architectures are part of higher cognition as they affect the decision and thought process [45]. Planning and decision-making can benefit from cognitionenabled agents. Reasoning on a recognized scene allows robots to calculate an optimal path by accurately localizing itself, the goal and obstacles or dangerous areas [30]. Safety rules applied on a robot and the ability to recognize areas of potential hazard, promote a safe environment both for the robot and the humans [43]. A holistic approach to thinking with human-like cognitive reasoning and decision making processes, is far from realised, and thought processes are relatively basic at the moment.

Social robots can greatly benefit from emotional cognition [16]. Robots with the ability to recognize and express emotions (anthropomorphism) promote an easier and more effective interaction with humans [38], and robots that express empathy have been shown to help humans alter negative feelings to positive ones [5,21].

## 3 Cognitive Architectures

Modeling human cognition has led to the formal definition of cognitive architectures. Although first order logic approaches [20] allowed the gradual refinement of the performed actions, agents continued to lack the ability to merge new

Fig. 2. A schematic of ACT-R (a) and KnowRob 2.0 (b) architectures.

information with existing beliefs. This led to the proposal of more complex architectures. A selection of often used cognitive architectures is briefly introduced here (Fig. 2).

A commonly used architecture is ACT-R [2] where knowledge is divided based on the type of information (facts or knowledge on how to do things). Each component is accessed via a dedicated buffer, and the contents of these buffers represent the state of the world. ACT-R is based on productions, i.e. "IF" - "THEN" rules. When the current state of the world matches the precondition (using a pattern matcher module), the rule is triggered executing the relevant action. Productions, when executed, alter the state of the buffers and hence the state of the system.

A more detailed representation of human cognition is attempted by LIDA (Learning Intelligent Distribution Agent) cognitive architecture [18,19]. LIDA assumes that cognition functions on cycles with distinct phases. The first phase is perception and understanding allowing the agent to perceive the world and update the understanding of the current state. The next phase is the attention phase, where information is filtered, and the conscious content is broadcasted, followed by the action and learning phase.

The KnowRob 2.0 architecture [4] is designed specifically for robots, allowing them to perform complex tasks. At the core of the architecture are the ontologies (a subject's properties and relationships) and axioms (rules a priori true). A photorealistic representation of the environment is used for reasoning, allowing the agent to simulate its actions. Actions are stored as episodes allowing recall or knowledge transfer.

Several cognitive architectures can be considered for artificial cognition, and are extensively studied and presented by BICA [1]. In addition to the above architectures, SOAR [26], Icarus [27], and Clarion [39] are often used.

### 4 Position

Artificial cognitive architectures try to imitate human cognition - the epitome of cognitive systems. Some of the cognitive architectures – such as ACT-R, SOAR, LIDA – are primarily an attempt to model human cognition; whereas others – e.g. KnowRob – are inspired by human cognition but aim primarily at an architecture for artificial cognition. Cognitive architectures are progressing and gradually moving closer to human cognition, however, there is still huge uncharted ground, and a long way to go.

Semantic scene understanding, and holistic perception are only to a very basic extent realised thus far, merely at a proof-of-concept level, and there is considerable scope for further development in this area.

The importance of language in cognition was identified in early studies. Cognitive structures and capabilities are affected by language [8,37]. Despite the huge advances in speech analysis, translation, and synthesis, language is currently merely incorporated as an input/output interface in robotic systems, and is hardly included in any of the artificial cognitive processes [14,44].

Emotions have only recently been recognized as a part of cognition in humans [28,32,41] as they have previously been considered as innately hardwired into our brains. In LIDA, emotions are expressed as nodes that when triggered lead to experiencing the corresponding emotion. This is important in particular for good interaction between artificial systems and humans [13,38]. However, emotions are not incorporated in the thought process in any of the architectures or implementations, whereas in humans they often play a central role in decision making.

Currently robots are not explicitly ethical, and lack moral judgement. Ethical and moral rules have been used to that end as they can potentially affect both the acceptance of robotic applications and robotic decision making [29,33]. Norm violation may decrease human trust in an agent, therefore the agent should alter or completely discard a plan if it goes against moral values [6,12]. A fair amount of work has been done on moral reasoning and logic [15,42]. Nevertheless, moral reasoning and evaluation is not yet incorporated in cognitive architectures, neither is it an integral part of a holistic decision process. Although ethics and moral values may not be considered as part of cognition directly, in fact they play an important role in human decision making, govern human behavior, and will be instrumental for developing responsible robots.

Another relatively neglected area is artificial curiosity and imagination. While KnowRob 2.0 implements a basic form of imagination to anticipate outcomes as robots imagine the effect of their actions in their inner world representation, it is only associated to sensory-motor action and planning. Innate curiosity for exploration, global optimization, and knowledge acquisition is not explicitly accounted for in any of the reviewed architectures. This ability is critical for robots operating autonomously in unknown environments, and will allow them to effectively solve tasks even when their knowledge is not complete, and there is no human to provide the necessary information [35,36].

Moreover, current cognitive systems do not explicitly account for ingenuity. Ingenuity is the ability to employ tools or existing knowledge and use them to solve new problems in new unrelated domains. This will require complex abstraction, and synthesis of knowledge and skills. This ability will enable artificial agents to solve complex problems, and invent good solutions even when they do not have all required knowledge, sufficient experience, or the optimal tools at their disposal.

The human brain comprises two interconnected hemispheres – the left and the right – that have distinct functions and operate in different ways. The left hemisphere stands for linear thinking, detail-oriented perception, facts processing, computations, language processing, planning, logic. The right hemisphere stands for holistic thinking, holistic perception, intuitive thinking, imagination, creativity, emotional and moral evaluation. Current models of human cognition are computational in nature and represent primarily the functions of the left hemisphere. The operation and processes of the right hemisphere are by far less understood, and they are not explicitly included in the models of human cognition, let alone in robotic systems.

Our approach to attending to the above challenges in order to develop next generation robotics and intelligent systems, builds upon two main hypotheses/ positions:


#### 5 Summary

In this paper we have made the case for cognitive robotics and presented our approach to next generation advanced systems. We have given an overview of human cognition, an account of cognition-enabled systems and the state of the art, and a brief outline of a selection of cognitive architectures that can lend themselves to artificial cognition. The validity of our approach remains to be demonstrated. Artificial cognitive systems are emerging, and currently at a rather early stage of development. In our opinion, they are the cornerstone towards next generation advanced robotics, the key to unlocking the potential of robots and artificial intelligence, and enabling their use in real-life applications.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Pattern Based Software Architecture for Predictive Maintenance**

Ants Torim(B) , Innar Liiv, Chahinez Ounoughi, and Sadok Ben Yahia

Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn, Estonia ants.torim@taltech.ee

**Abstract.** Many industrial sectors are moving toward Industry Revolution (IR) 4.0. In this respect, the Internet of Things and predictive maintenance are considered the key pillars of IR 4.0. Predictive maintenance is one of the hottest trends in manufacturing where maintenance work occurs according to continuous monitoring using a healthiness check for processing equipment or instrumentation. It enables the maintenance team to have an advanced prediction of failures and allows the team to undertake timely corrective actions and decisions ahead of time. The aim of this paper is to present a smart monitoring and diagnostics system as an expert system that can alert an operator before equipment failures to prevent material and environmental damages. The main novelty and contribution of this paper is a flexible architecture of the predictive maintenance system, based on software patterns - flexible solutions to general problems. The presented conceptual model enables the integration of an expert knowledge of anticipated failures and the matrix-profile technique based anomaly detection. The results so far are encouraging.

**Keywords:** Predictive maintenance *·* Conceptual architecture *·* Analysis patterns *·* Machine learning *·* Time series analysis *·* Matrix profile

## **1 Introduction**

The increasing capabilities of data collection mechanisms have evolved new intelligent solutions for decision-making. The burgeoning advancement in Machine Learning (ML) algorithms have yielded a tangible impact on decision-making techniques. In addition, adapting efficient management systems for maintenance work can decrease the unpredicted costs during equipment failures and shutdown periods.

Indeed, industrial equipment failure can be costly or even endanger personal safety. Therefore, moving from simple schedule-based maintenance to smart sensor-based predictive maintenance systems has become increasingly popular. However, these systems are not simple, and there are several approaches to them, including those based on expert knowledge and those based on machine learning. We propose a conceptual architecture that combines these using known solutions (patterns) from the field of software engineering. The proposed architecture is based on a real-world, but anonymous industry implementation of predictive maintenance, but the approach, observations and discussions presented here are of general interest for anyone designing practical predictive maintenance systems.

The remainder of this paper is organized as follows. In Sect. 2, we present an overview of the related work about predictive maintenance models. In the next Sect. 3, we discuss the methodological contribution of our work compared to the related work and their practical applications in real-life scenarios. Next, in Sect. 3.1, we briefly present the Matrix Profile Method for Predictive Maintenance and discuss its application from the perspective of our case study. The penultimate section describes the conceptual pattern-based architecture for predictive maintenance using UML class diagrams. Finally, the conclusion and issues of future works are stated in Sect. 5.

## **2 Related Work**

Hashemian [6] identifies eight applications of equipment condition monitoring: process optimization, personal safety, equipment health, emission monitoring, process diagnostics, equipment performance, leak detection and calibration verification. Predictive maintenance has been a field of active study [3,5,9].

In their literature review on predictive maintenance, Ran et al. [14] state the cruciality of maintenance in the industry, damage of unplanned downtime, and the capability of emerging technologies to make predictive maintenance widely accessible. They divide the approaches to predictive maintenance into three:


For machine learning, we make heavy use of the Matrix Profile method invented by Eamonn Keogh and Abdullah Mueen [16] which we will describe later. Our implementation is built on Python stumpy Matrix Profile library [8].

Our architecture aims to combine the rule-based approach with traditional or deep learning methods. To achieve that, we propose an architecture based on archetype and analysis patterns from software engineering. Analysis patterns were described by M. Fowler [4] as groups of concepts that represent a common construction in business modeling that may span many domains. Business archetype patterns (namely Product, Party, Order, Inventory, Quantity, and Rule), originally introduced by Arlow and Neustadt [2], are the universal information models describing the universe of discourse of businesses [13]. The archetypes and archetype patterns were further explored in the works of Piho et al. [11,12] as part of Sentry (sample entry) software for CBPG (Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds). Interestingly, the Rule pattern that is fundamental for our architecture is described in these works as covering all the fundamental business requirements - What (Things), How (Processes), Where (Locations), Who (Persons), When (Events), Why (Strategies) - of Zachman Framework [17]. To best of our knowledge the application of these patterns to the field of predictive maintenance has not been explored before. We implemented our prototype system based on this architecture through Grafana observability platform<sup>1</sup>.

#### **3 Matrix Profile Method for Predictive Maintenance**

We make heavy use of a Matrix Profile (MP) method in our predictive maintenance system. While our architecture is general and can work with different methods, it is an effective method and an excellent example to explore the issues of time series analysis.

Time series analysis typically looks at anomalies and motifs/patterns. The basic idea of MP is to calculate the distance between pairs of sub-sequences in the time series by computing the distance between each pair. An important parameter for MP is the window size: length of sub-sequences. Despite the simplicity of a naive algorithm based on nested loops, it can take years or months to get an answer for a moderately sized time series using this method. However, using the Matrix Profile algorithms, computation time is significantly reduced. It is a relatively new time series analysis data structure invented by Eamonn Keogh [10] at the University of California Riverside, and Abdullah Mueen at the University of New Mexico [16]. A matrix profile is a vector that stores the z-normalized Euclidean distance between any sub-sequence within a time series and its nearest neighbor. This algorithm is agnostic to domains, fast, supplies an exact solution, and only requires one parameter (window size).

In Fig. 1 we show at the top a synthetic time series from a temperature sensor, which contains two sawtooth patterns and added noise. Temperature is standardized to the mean of zero and standard deviation of one. At the bottom we show the corresponding Matrix Profile values for sub-sequences of the length 640. MP values correspond to distance to nearest (most similar) sub-sequence and therefore low MP values show repeating motifs. As we can see, these identify our two sawtooth patterns. Anomalies can correspondingly be identified by large MP values as shown in Fig. 2.

Multi-dimensional MP [7] generalizes finding motifs and anomalies to multidimensional cases. This algorithm [15] requires careful design to minimize the additional time complexity owing to its computational complexity. In addition, although it may be tempting to look for motifs in all available dimensions (i.e., the motif must exist in all dimensions and co-occur), it has been shown that this rarely produces meaningful motifs except in the most contrived of cases [7]. An alternative strategy would be to reduce time-series dimensions to a subset of "useful" dimensions before assigning a sub-sequence as a motif.

#### **3.1 Matrix Profile in Our Case Study**

Our goal is to analyze the collected sensor data for failures in the equipment alongside detecting sensor anomalies. In our case study, we analyzed the

<sup>1</sup> https://grafana.com/grafana/.

**Fig. 1.** Matrix profile motif detection

**Fig. 2.** Matrix profile anomaly detection

available sensor data and the correlations, adopted the MP algorithm for detecting anomalies and patterns in time-series data and extracted potential rules about sensor anomalies and device failures.

Our training dataset contained 169*,* 489 rows *<sup>×</sup>*17 columns (sensors) collected over seven months. At the moment we have no legal rights to publish actual data from the project.

A Pearson correlation study demonstrated that the sensors' observations of the same type were highly correlated. As a result of the correlations, measurements from one sensor could be used to represent all the others of the same kind. If one goes down, the data from the other is still sufficient since their behavior is correlated.

MP is a visual method and our prototype supports visual explanations through Grafana<sup>2</sup> dashboard. We use both single- and multi dimensional MP. We mention shortly that we have concluded that selecting the window size is crucial for MP. We get fewer motifs whenever we go longer and vice versa. In addition, we notice anomalies at different points in the time series.

<sup>2</sup> https://grafana.com/grafana/.

## **4 General Architecture and Patterns**

We describe the conceptual pattern-based architecture for predictive maintenance using UML class diagrams. We base this on known analysis patterns [4] and architectural patterns [2].

#### **4.1 Domain Modeling, UML Class Diagrams and Patterns**

First, we introduce the main ideas of domain modeling. The domain is any field that computation can be applied to [11]. Domain (Conceptual) modeling aims to identify, detect and define the concepts from the domain. The latter are usually but not always, mirrored within the computational system. For example, in a healthcare system, we may identify concepts like Patient, Observation, Protocol, etc. [4]. In the field of trading, we can identify concepts like Derivative Contract, Contract, Portfolio, Trading Package, etc. [4]. Potential concepts can be at a higher or lower level of abstraction (Measurement vs. Blood Pressure Measurement) and be more or less relevant for certain goals (Blood Pressure Measurement is not relevant for the goals of Trading but relevant for Healthcare). Models are not right or wrong, but more or less valuable [4]. However, models that describe confusing concepts or are not general enough can make extending, scaling, and changing a system a nightmare. General models used in many different domains and named and described in the literature are called patterns. Patterns are meant to help with describing and proliferating good design practices [4]. They are widely used in software engineering, but with the rapidly increasing complexity of AI systems like predictive maintenance, they should become more relevant.

**Fig. 3.** Class diagram UML notation [1]

A visual way to describe domain concepts is through class diagrams of Unified Modeling Language (UML) [1]. Class diagrams can describe either domain concepts or software classes. We provide a short explanation of this notation in Fig. 3.

#### **4.2 Patterns for Predictive Maintenance**

A system of predictive maintenance should ideally combine the expert knowledge with the power of AI systems like MP to detect outliers and anomalies. Our aim is common with the field of software engineering: to protect the system from variations arising from changing user requirements.We do not want to change, reprogram and redeploy our system when user tolerance for false positives and false negatives changes, when user decides that certain indicator is not useful anymore etc. Therefore we introduce a rule interpreter that allows user/expert to add, change and delete various alert rules, controlling the behaviour of our system. We also note that these rules could apply to both actual sensor observations and machine learning anomaly/motif estimates derived from those observations. Therefore we generalize both of those into Indicators. Our general split between expert defined rules and various indicators that these rules apply to was present before we made the connection to the patterns from the field of software engineering but this connection has enabled us to clarify and generalize our approach. As it is common with patterns we have modified them to suit the needs of our specific field of predictive maintenance.

#### **4.3 Observation and Indicator Patterns**

Observation pattern (Fig. 4) was originally described by M. Fowler as follows [4]:

Observation observes some actual parameter that is either a quantity or category. Observation can, of course, be mistaken. Here we reproduce only the basic pattern which is described and extended in Fowler's book [4]. Fowler provides various extensions for the fields of Medicine and Corporate Finance.

For the field of predictive maintenance observations come from the sensors. Experts may want to use them directly in alert rules using their expert knowledge to complement machine learning estimates.

**Fig. 4.** Martin Fowler's Observation pattern [4]

We generalize observations from sensors and Machine Learning estimates into Indicators (Fig. 5) used for predictive maintenance analysis and alerts.Both estimates and observations are indicators.

While observation is a qualitative or quantitative statement about some measured phenomenon, an estimate is a qualitative or quantitative statement calculated based on the observations according to some method. For example, a temperature reading from some sensor is an observation, predicted likelihood of component failure is an estimate.

**Fig. 5.** Indicator pattern that extends on M. Fowler's Observation pattern [4].

#### **4.4 Time Series Estimates Using Matrix Profile**

We make heavy use of Matrix Profile and Multi-Matrix Profile estimates. They have to fit into our system of patterns which they do easily through specialization. Both are time series estimates based on a time series of observations. Both are Anomaly or Motif estimates that indicate if a particular sub-sequence of observations is a typical motif or a rare anomaly. Matrix Profile is based on a single class of observations, and Multi-Matrix Profile is based on several classes of observations. Estimating anomalies and motifs from a time series is an exceedingly useful specialization of general estimate for predictive maintenance.

#### **4.5 Methods for Calculating Estimates**

In Sect. 4.3 we mentioned a general Method that calculates Estimates from Observations. Here we introduce two specific and important subclasses of these methods. Our method for computing estimates may be realized through Queries, Pipelines, or a mix of these.

**Fig. 6.** Time series estimates and Matrix Profile [16]

**Fig. 7.** Query and Pipeline methods for computing estimates.

Query methods (Fig. 7) are based on SQL-like queries from the observation database. They can also be implemented through systems like Grafana<sup>3</sup>.

Pipeline methods (Fig. 7) transform the data through a series of transformers and finally calculate the estimate on the transformed data using a predictor. This pipeline pattern is common in machine learning, for example it is used in a popular Python machine learning library scikit-learn<sup>4</sup>.

<sup>3</sup> https://grafana.com/grafana/.

<sup>4</sup> https://scikit-learn.org/stable/.

#### **4.6 Rules and Alert Rules**

We have already mentioned the need to allow experts to inject their knowledge into the system of predictive maintenance. Our solution is based on a general rule pattern well known from the field of software engineering. Rule pattern was originally described by Arlow and Neustadt [2] and has been used in various systems including laboratory management [11], etc. This pattern is central for our conceptual architecture.

**Fig. 8.** Rule archetype pattern from Arlow and Neustadt [2].

As shown in Fig. 8 the original Rule is a sequence of RuleElements that can be Variables or Operators. Variables have values for specific RuleContext that could describe a date, a time, and an industrial site. If needed, a Party (e.g., an Expert) could override a Rule using RuleOverride.

Figure 9 describes our proposed Alert Rule pattern. We replace the Rule from Arlow and Neustadt [2] with an Alert Rule. Rule Element sets a threshold value for a particular Indicator. Indicators replace the Variables in the original pattern. For example, if temperature observation is higher than 50*C* or if Matrix Profile value is lower than 10 then the rule applies.

**Fig. 9.** Alert Rules are based on a rule pattern [2].

Our conceptual architecture based on Indicators, Matrix Profile estimators, and Alert Rules should provide a pattern-based architecture that is flexible and extensible.

#### **5 Conclusions**

The article proposes a conceptual architecture that splits the system into a rule/expert system layer and an indicator/machine learning layer. This architecture is based on known patterns from the field of software engineering. It establishes expert rules on indicators from sensors and machine learning methods like Matrix Profile and allows for independent, flexible evolution of rule management and machine learning subsystems. This paper also presented one potential set of technologies to implement such conceptual architecture, using Grafana open observability platform, Python's stumpy framework [8] for Matrix Profile. To the best of our knowledge the application of Rule and Observation patterns in our predictive maintenance architecture is a novel contribution.

We have already started building a scalable near-real-time platform to detect underperformance and uptake the proactive maintenance activities. We use the most cutting-edge techniques from machine/Deep learning and big data fields for that purpose.

The technical impact will be manifested through increased innovation facilitated by new AI solutions. The architecture of near-real-time management tool offers some distinctive features :


#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **AI in Cyber and Digital Sphere**

# An Overview of Artificial Intelligence Used in Malware

Lothar Fritsch(B) , Aws Jaber , and Anis Yazidi

Department of Information Technology, Faculty of Technology, Art and Design, Oslo Metropolitan University, Oslo, Norway {lotharfr,awsalzar,anisy}@oslomet.no https://www.oslomet.no

Abstract. Artificial intelligence (AI) and machine learning (ML) methods are increasingly adopted in cyberattacks. AI supports the establishment of covert channels, as well as the obfuscation of malware. Additionally, AI results in new forms of phishing attacks and enables hard-todetect cyber-physical sabotage. Malware creators increasingly deploy AI and ML methods to improve their attack's capabilities. Defenders must therefore expect unconventional malware with new, sophisticated and changing features and functions. AI's potential for automation of complex tasks serves as a challenge in the face of defensive deployment of antimalware AI techniques. This article summarizes the state of the art in AIenhanced malware and the evasion and attack techniques it uses against AI-supported defensive systems. Our findings include articles describing targeted attacks against AI detection functions, advanced payload obfuscation techniques, evasion of networked communication with AI methods, malware for unsupervised-learning-based cyber-physical sabotage, decentralized botnet control using swarm intelligence and the concealment of malware payloads within neural networks that fulfill other purposes.

Keywords: Information security *·* Artificial intelligence *·* Malware *·* Steganography *·* Covert channels *·* Machine learning *·* Adverse artificial intelligence

## 1 Introduction

In recent years, AI has been increasingly adopted as part of cyber attack methods. The application of AI on the defender's side has been successfully used in intrusion detection systems and is widely deployed in network filtering, phishing protection, and botnet control. However, the enhancement of the capabilities of malware with the help of AI methods is a relatively recent development.

This article presents the result of a literature survey mapping the state of AIpowered malware. The salient aims of this survey is to map AI-enhanced attacks carried out by malware, to identify malware types that conceal themselves from detection using AI techniques, to get a better understanding of the maturity of those attacks, and to identify the algorithms and methods involved in those attacks (Fig. 1 and Table 1).

Fig. 1. Uses of AI in malware.



## 2 Literature Review on AI-Powered Malware

#### 2.1 Literature Search

For assessing the state of the art in AI-supported malware, we performed a literature search using the Google Scholar database of scientific publications. We defined the search criteria as follows. Search keywords were *malware, artificial intelligence, machine learning* combined with *offensive, adversarial, attack, network security, information security*. The resulting articles were checked against inclusion criteria. The resulting article set was then snowballed backward and forward [36]. We limited the backward snowballing range by cutting off snowballing for articles older than 2010. Eligible forms of publications were *scientific articles, conference presentation, pre-prints and technical reports*. For inclusion, articles needed to contain *descriptions of malware functionality based on machine learning or AI functionality*. Both *survey articles* as well as *articles describing demonstrators or specific malware* were included. Our final set of articles were 37 articles.

After collecting the articles, we classified the articles into categories reflecting the specific malware functionality enhanced with AI techniques. Our findings are summarized below.

#### 2.2 Findings

Among the deployed technologies are authentication factor extraction, generation of phishing and malware domain names, adaptive generation of phishinge-mail, direct attacks against malware detection (code obfuscation, model poisoning) and intrusion detection (generative traffic imitation as well as AI model poisoning attacks). In addition, we found publications describing the successful parsing and controlling of graphical application user interfaces (GUIs). Finally, self-learning malware aimed at sabotage of or through cyber-physical systems was found. In particular, the evasion of detection of malware and the exfiltration of information through covert channels have been recently used in AI-powered malware.

The establishment of covert channels is an established practice for malware distribution, command and control of malware agents, and information exfiltration. Such covert channels intend to bypass intrusion detection, malware detection, and anomaly detection systems.

#### 2.3 Surveys

Our search found 13 survey articles that were either fully or partially present knowledge about AI-enhanced malware (see Table 2). We found ten surveys, two taxonomic articles, and one anecdotal collection of AI attack use cases.

The surveys focus on different perspectives of the offensive use of AI against information security in malware:



Table 2. Surveys and taxonomies

#### 2.4 AI-Enabled Attacks on Authentication Factors

Four articles described attacks against authentication factors on mobile devices'. The devices' sensors (microphone, accelerometer) were used in combination with AI models with the intention of extracting PINs, passwords, and patterns. The articles are listed in Table 3. We found two categories of AI weaponization against authentication factors:



Table 3. Password extraction or prediction

#### 2.5 Techniques for Hiding Malware Code from Detection

AI is frequently used for hiding malware code from detection. The eleven articles listed in Table 4 show these approaches:



Table 4. Code detection evasion

#### 2.6 Evading Network Traffic Detection

Hiding malware's communication traffic is published in four articles (see Table 5). AI and specifically unsupervised learning, is deployed against intrusion detection systems. Demonstrators described in the articles hide probing and infiltration traffic as well as command and control traffic. One noteworthy article deploys swarm intelligence in order to coordinate Botnet agents without a centralized command server.


Table 5. Evasion of network intrusion detection

#### 2.7 Other AI Deployment

Table 6 lists the miscellaneous applications of AI in the malware context. We found six articles describing enhanced capabilities in the areas of phishing, Application control and sabotage. AI is used for creating phishing domain names that evade detection in anti-phishing-systems. One spear phishing demonstrator extracts social media sentiments using AI in order to turn them into phishing e-mail-text, learning which topics are susceptible of currently provoking most reaction from the targets.

An interesting application of image recognition is malware that can understand graphical user interface elements with AI with the goal of finding out which GUI elements it can control to execute functionality.

Finally, undetectable sabotage in cyber-physical systems has been demonstrated in two cases: i) A surgical robot which - injected with malware - can learn how to modify its actions similar to normal actions in order to hurt patients. ii) The second demonstration case showed how to AI can learn to manipulate smart house technology in ways that will be hard to notice. Such AI-empowered sabotage is envisioned to be used against variable targets, dramatically leveraging the preparation effort of cyber sabotage.

## 3 Discussion of Findings

The presented survey investigated the use of artificial intelligence (AI) techniques and of machine learning (ML) for the improvement of malware capabilities. We found surveys and literature that describe a variety of deployments of AI in the malware context:


Table 6. Miscellaneous AI applications in malware


We conclude that AI deployed to either improve or hide malware poses a considerable threat to malware detection. Code obfuscation, code behavior adaption, as well as learned communication detection evasion potentially bypass existing malware detection techniques.

Offensive deployment of AI within malware improves malware performance, including methods such as selection of targets, extracting authentication factors, enabling the automated and fast generation of highly efficient Phishing messages, and swarm-coordinated action planning.

We consider AI-enhanced malware to be a serious risk for information security, which should be thoroughly investigated.

Acknowledgements. The work leading to this article was partially sponsored by OsloMET's AI Lab.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Fake News Detection by Weakly Supervised Learning Based on Content Features

Özlem Özgöbek1(B) , Benjamin Kille<sup>1</sup>, Anja Rosvold From<sup>2</sup>, and Ingvild Unander Netland<sup>3</sup>

<sup>1</sup> Norwegian University of Science and Technology, Trondheim, Norway {ozlem.ozgobek,benjamin.u.kille}@ntnu.no <sup>2</sup> Bredvid AS, Oslo, Norway

<sup>3</sup> Kantega AS, Oslo, Norway

Abstract. Fake news, defined as the publication of false information, either unintentional or with the intent to deceive or harm, is one of the important issues that affects today's digital society significantly. All around the world, journalists and fact checking organizations are trying to fight this problem manually. However, fighting fake news is a timesensitive task. Once leaked, fake news spreads fast and its impact on society increases. Because of the complex and dynamic nature of news, applying artificial intelligence methods to address the automatic detection of fake news is a challenging task. This work explores the use of weak supervised learning for fake news detection by using only the content of news articles. This is particularly important when the contextual information is not available or difficult to obtain quickly. To our knowledge, this is the first work which uses a content-based approach in weak supervised learning without the use of any contextual information for fake news detection. We propose an architecture that generates weak labels. We explore the effect of using weak labels for fake news detection with five different machine learning models. We demonstrate that weakly supervised learning is an effective approach to the automated detection of fake news in the absence of high quality labels.

Keywords: Fake news detection *·* Disinformation *·* Weakly supervised learning *·* Content features

## 1 Introduction

The spread of fake news is not a new problem. However, with the advancement of the internet and social media, it has become a growing problem [32]. Fast and uncontrolled spread of fake news can affect society in many ways, including ideological polarization [24] and psychological bias [28]. During the Covid-19 pandemic, we have experienced how problematic the situation can be globally [7, 20]. Despite the ongoing efforts for developing automated fake news detection systems, most of the work is still being done by professional journalists in fact checking organizations all around the world<sup>1</sup>.

As machine learning is one of the promising techniques for automated fake news detection, one of the obstacles is the amount of accurately labeled training data that are available. Unfortunately, there are very few labeled datasets of sufficient size and quality for supervised learning in this domain. This is due to the scarce resources for manual fact checking and labeling efforts. In addition, because of the new events are introduced continuously, the content and topic of news articles are time-dependent and varied [4]. Weakly supervised learning, a new machine learning paradigm, has been developed to work with low-quality labels called weak labels [19].

In this paper, we present a fake news detection system that uses weakly supervised learning based only on content features. We have chosen to work with full news articles instead of social media posts or shares. Even though the social media is seen as the primary source of the spread of fake news, recent research [27] points out the importance of the coverage of fake news in mainstream media.

Weakly supervised systems can utilize content and contextual features. Previous work using this approach for fake news detection has given promising results [10]. However, the contextual features used in these efforts (e.g. likes, comments, shares) are time-dependent (they change over time), take time to accumulate, and are unavailable for some articles. Therefore, our approach is solely based on the content-based features extracted from the title and content of the articles.

To the best of our knowledge, this is the first work that uses weak supervision for fake news detection by using only content features. Our contributions are three-fold: We introduce a probabilistic weak labeling system that relies only on content features. We collect and present a test dataset from a set of factchecking organizations including Snopes<sup>2</sup> and PolitiFact<sup>3</sup>. The dataset has been made publicly available on Github<sup>4</sup>. We apply five machine learning classifiers for fake news detection with and without the weak labels to investigate the efficiency of using weak supervised learning with content features.

The rest of the paper is organized as follows: Sect. 2 discusses the state of the art and existing work on weakly supervised methods for fake news detection. Section 3 presents the dataset for the experiments. Section 4 outlines the proposed architecture and experimental design, and presents the findings. Section 5 concludes and gives an outlook to future research directions.

## 2 Related Work

Even though the fact checking tasks still rely on the professional journalists, the efforts of developing automated fake news detection systems has been in focus of

<sup>1</sup> https://reporterslab.org/fact-checking/.

<sup>2</sup> www.snopes.com.

<sup>3</sup> www.politifact.com.

<sup>4</sup> https://github.com/piiingz/fake-news-detection-test-set.

the researchers for the last couple of years. Within these research there is a wide variety of approaches. Crowd-sourcing has been proposed to obtain the labels for fake news [6,17]. However, human annotators can process a limited number of articles. [6] had 90 articles annotated, whereas [17] acquired labels for 240 articles. Crowd sourcing suffers from high costs and doubts in annotations' quality. [35] argues that with a large enough population of fact-checkers, indicating the credibility of articles remains feasible. How to attract and motivate a large enough population remains to be seen. Research on fully automated methods follows different approaches such as content-based, user-based, network-based, and hybrid methods which use the combination of other methods [15]. Contentbased methods focus on the analysis of text and non-text content, such as video or sound. For instance, Shrestha et al. [21] combine textual features, sentiment, writing style and psycho-linguistics to identify fake news. User-based methods look at user behaviour and comments to identify fake news [26]. Wang et al. [31] combines a weakly supervised approach with user reports. Network-based methods monitor network activity, which can help to detect bots and investigate the spread patterns. Conversely, Shu et al. [23] reports that humans spread more fake news than bots. In this case, finding out if a user is a real human may help to increase the clues a system collects. Moreover, [23] shows that fake news is more likely to be spread by fake accounts. The computational methods used in the automated detection of fake news is varied. Castelo et al. [4] proposes a topic-agnostic approach to the classification of fake news by using web-markup in addition to LIWC (Linguistic Inquiry and Word Count), and stylistic features. With this approach, they focus on identifying the non-credible web pages spreading fake news instead of detecting individual fake news articles.

Weakly supervised learning (mainly together with contextual features) has been used for fake news detection by many researchers. Helmstetter and Paulheim [10] apply weakly supervised learning to microblogs for detecting fake news in social media and obtain an accuracy of approximately ninety percent. Wang et al. [30] uses reinforcement learning for fake news detection with the use of crowd-sourced labels. Yuan et al. [34] combine weakly supervised learning with a structure-aware multi-head attention network to identify fake news. Weakly supervised learning has been used with content features for tasks such as learning discourse structures in dialogues [2] and building a text classifier in combination with transfer learning [25].

#### 3 Dataset

To choose the best suited dataset for this task, we have reviewed 14 datasets. Table 1 presents an overview of these datasets. Our evaluation considered four properties: size, features, class balance, and labeling method. As a result, we have decided to use the NELA-GT-2019 [9] dataset<sup>5</sup>. The chosen dataset has a large amount of data for all classes. Thus, it supports creating class-balanced subsets. The dataset's features include title and content. The documentation

<sup>5</sup> At the time this work started NELA-GT-2020 dataset was unavailable.

of the dataset is excellent. NELA-GT-2019 comprises 1.12M news articles from 260 mainstream and alternative news sources. It has been collected between 01 January 2019 and 31 December 2019. There are four different labels: *reliable*, *mixed*, *unreliable*, and *unknown*. In this work, we consider articles labelled *reliable* as credible news and *unreliable* as fake news. We discard the labels *mixed* and *unknown*. The labels in the dataset have been assigned based on the credibility of news sources which does not guarantee the correctness of information itself.

Dataset Collection. For more realistic assessment of the developed models, we have collected an independent dataset which consists of manually fact-checked news articles. So the labels of the news articles in this test set are not based only on the sources they were published on, but on the decision of professional fact checkers. In addition, we payed attention to the publishing date of articles to avoid testing our models on the same news items as were included in the training dataset. Therefore the articles collected for this dataset were published in a different period than NELA-GT-2019. During the collection of this dataset we have used entries from FakeNewsNet<sup>6</sup> [22] and MisInfoText [1] datasets as well as manual collection of articles from Snopes fact-checking archives<sup>7</sup>. The collected dataset includes 434 news articles where half of them is fake and the other half is real news articles. This dataset is available on Github<sup>8</sup>.

## 4 Architecture, Experiments and Results

The proposed system consists of two main components: The weak labeling system and the classification models that use weakly supervised learning. For each of these main components we ran a series of experiments in order to find the best performing models. Then we combine these in our proposed architecture. Figure 1 shows the overall architecture of the proposed system.

Fig. 1. Overall system architecture

First we apply pre-processing and feature engineering to the raw data. Then the output is passed to the weak labeling system which generates weak

<sup>6</sup> https://github.com/KaiDMML/FakeNewsNet.

<sup>7</sup> https://www.snopes.com/fact-check.

<sup>8</sup> https://github.com/piiingz/fake-news-detection-test-set.


Table 1. Fake news datasets reviewed in this work.

labels. After the application of document representation, weakly labelled data is passed to the end model. We have experimented with *Snorkel* and *Snuba*, two weak labeling frameworks, and five classifiers: Logistic Regression, XGBoost [5], ALBERT [12], XLNET [33], and RoBERTa [13].

In the following sections, all these steps are explained in detail and the results from various experiments are presented.

#### 4.1 Data Pre-processing

The pre-processing steps includes applying natural language processing (NLP) techniques such as normalization, stop word removal, and tokenization to the news text. More specifically we have normalized the text, removed punctuation, digits and stop-words, and tokenized into words, bigrams, trigrams and sentences. We used the NLTK word tokenizer<sup>9</sup>, NLTK sentence tokenizer<sup>10</sup>, NLTK part-of-speech tagger<sup>11</sup>, WordNetLemmatizer<sup>12</sup> and Python's built-in lowercasing function. Each step has been applied to both the title and the body of the articles.

<sup>9</sup> https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word\_tokenize.

<sup>10</sup> https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent\_tokenize.

<sup>11</sup> https://www.nltk.org/\_modules/nltk/tag.html#pos\_tag.

<sup>12</sup> https://www.nltk.org/\_modules/nltk/stem/wordnet.html.

#### 4.2 Feature Engineering

After the pre-processing step, we have determined four types of relevant features based on the literature: stylistic features, complexity features, POS-tagging features, and sentiment features [3,11,16,18].

Stylistic features include author's writing style such as the use of exclamation marks and uppercase words; complexity features include implicit features of the text such as type-token ratio and words per sentence; POS-tagging features include all POS-tag related features such as the presence of verbs, nouns, and adjectives; finally, sentiment features include the sentiment scores of the text such as the scores for subjectivity, positiveness, and negativeness.

In total, we have used 68 extracted features such as *"ratio of stop words"*, *"number of quote marks"*, *"ratio of nouns per word"* and *"document negative score based on sentences"*. A complete list of these features can be found in [8]. For each of these features, the resulting numerical values such as the sentence count, word count etc. are then passed as an input to the weak labeling system.

#### 4.3 Weak Labeling System

The first main component of our architecture focuses on the generation of weak labels for fake news detection. For this, we consider two weak labeling frameworks: Snorkel<sup>13</sup> and Snuba<sup>14</sup>. We have run a set of experiments to compare these two frameworks in the context of fake news detection. Figure 3 shows the overall pipeline of the experiments to evaluate the weak supervised fake news detection.

During our experiments with Snorkel, in order to enhance the performance, we have developed three components for the weak labeling system: *Automatic threshold search*, *automatic labeling function (LF) generation*, and *labeling function (LF) selection*. In order to create this weak labeling system we have used a small portion of the labeled data we have which is not included in the evaluation of the end models to prevent the data leakage. Figure 2 shows the overall pipeline.

*Automatic threshold search* takes the instances described with descriptive statistics (such as title word count) as input and selects best feature values (thresholds) that define an instance being fake or real. *Automatic LF generation* component handles the automatic generation of labeling functions in Snorkel. The labels are assigned automatically based on the thresholds defined in the previous step by checking if a feature value of an instance is above or below the threshold. It is also important to find values that cover a large portion of the data set since the higher the coverage the higher the amount of labels assigned is. *LF selection* component handles the possible extremely noisy labels by selecting a portion of LFs. To do that, we evaluated three sets of LFs by using Snorkel's generative model and majority vote approaches: All LFs (*All*), LFs with

<sup>13</sup> https://www.snorkel.org.

<sup>14</sup> https://github.com/HazyResearch/reef.

an individual accuracy above 65% (*Acc > 65%*, this value has been chosen as a result of separate experiments) and top 25 LFs based on their accuracy (*Top 25* ).

As a result of our experiments, we have found that the best performing model was Acc > 65% with an accuracy of 0.710 and coverage of 0.860. More details of these components can be found in [8].

Fig. 2. The pipeline of the automatic weak labeling system in Snorkel. The purple color indicates the components developed in this work. The white color indicates preliminary processing, yellow color indicates the processes handled by Snorkel and the gray color indicates the input and output of the system. (Color figure online)

Snuba framework has been proposed by [29] and it creates heuristics that assign probabilistic labels to instances. Compared to Snorkel, it generates less noisy labels and provides more diversity of instances labeled. In this work we have implemented a weak labeling system using Snuba and tested it with tree types of heuristics, namely decision trees, logistic regression and k-nearest neighbor (KNN). Following the findings of [29] which suggested that the maximum cardinality below four would be sufficient for most real-world tasks, we have experimented with the values below four. Due to the hardware limitations we could not get any results from KNN max cardinality three. The results from these experiments are shown in Table 2. Based on these results we have chosen the best method based on accuracy and coverage. Note that the portion of the data set we have used for these experiments does not contain the data from the weak label generation part to prevent data leakage.

As a result of our experiments with Snorkel and Snuba, we found that Snuba achieves an accuracy of 0.765 and coverage of 0.902, outperforming Snorkel both in terms of accuracy and coverage. We explain this with Snuba's heuristics being more complex than Snorkel and taking the heuristic's diversity into account. Therefore we use Snuba as our weak labeling component. Then, we run the best performing weak labeling system on the *manually labeled test set* to assure that the classifiers would perform better than the weak labeling system so that it is reasonable to train end models. We observed that *Snuba, DT, 3* achieved an accuracy of 0*.*646, *F*<sup>1</sup> score of 0*.*668 and coverage of 0*.*956.

#### 4.4 Document Representation

Classifiers require the input to come in the form of numerical vectors. We experiment with two different methods to obtain such vectors from the output of


Table 2. The results from the experiments with different types of heuristics of Snuba.

the weak labeling system: TF-IDF and BERT-specific. BERT-based models are designed to deal with raw text which reduces the processing to two simple steps. First, we merge the articles' title and content. Second, we trim the text to conform to the maximum length of token supported by the models. For Logistic regression and XGBoost, we used TF-IDF with an array size of 6000.

#### 4.5 Weakly Supervised Learning

We have trained five models—Logistic Regression, XGBoost, ALBERT, XLNet, and RoBERTa—to determine the best performing classification model for weakly supervised learning in this domain. We have chosen these models based on their previous success for fake news classification [14]. We have also trained the same models as supervised end models for the comparison. Table 3 shows the size of datasets used in this experiment. As it is shown in Fig. 3, both weakly supervised models and supervised models take a portion of the labeled data as input. The weakly supervised models take the weakly labeled data from the weak labeling system as an additional input.

Table 4 presents the results from our experiments with these models using weak labels. Results show that RoBERTa outperforms the four other classifiers, reaching to an accuracy of 0.753, F1 score of 0.779 for supervised and an accuracy of 0.779, F1 score of 0.798 for weakly supervised method on the manually created test set. The second best performing model in this setting is the XLNet with an accuracy of 0.719, F1 score of 0.742 for supervised and an accuracy of 0.733, F1 score of 0.752 for the weakly supervised method. Results of these experiments show that weakly supervised method performs slightly better than the supervised approach. These results also suggest that the combination of weak labeling system and classifier perform better than the weak labeling system alone as it was explained in Sect. 4.3.

In order to understand how the amount of weak labels introduced affects the weakly supervised model, we have experimented with three different ratios of weak labels. Based on the result of the previous experiment, we have used

Fig. 3. Experimental pipeline for the end models.

RoBERTa for both weakly supervised and supervised models. First, we have trained our models with all the weak labeled instances (approx. 170K), and then 50K and 25K weak labeled instances respectively, where the total number of instances in the dataset for this set of experiments is approximately 201K. Table 5 shows the results from these experiments. The results of these experiments indicate that the supervised model performs better than the weakly supervised method. As we keep adding more weak labeled data the performance decreases. The weak labeled instances are selected by confidence. This suggests that high-confidence labels contribute best to the detection, whereas low-confidence labels spoil the performance. However, results also show that the difference between these models, (especially the supervised, weak 25K and weak 50K) is marginal. Given that we have tested our system with only one test set, we do not know how the results would change for other datasets. Additionally, our test set is relatively small compared to the training set (see Table 3). We expect weakly supervised models to perform better in conditions where the test set is similar or larger in size as the training data set. We believe that weakly supervised learning for fake news detection is a promising method and should be explored further. Also more research is required to verify the effect of weakly labeled data for fake news detection.


Table 3. Size of datasets used.

Table 4. Comparison of classifiers. For each of the five classifiers, we list the scores on the *manually created test set*, as well as the difference between the usage of weakly supervised labels. The rows refer to Logistic Regression (LR), XGBoost (XG), ALBERT (AL), XLNet (XL), and RoBERTa (Ro).


Table 5. The comparison of supervised and weakly supervised models with different ratios of weak labels.


## 5 Conclusions and Future Work

Automation will remain necessary to combat fake news as long as fact-checkers remain a scarce resource. Fake news classifiers rely on accurate labels. This work proposed and explored the use of weakly supervised learning that relies only on the content features. Our observations on the performance of different weak labeling frameworks suggest that Snuba performs better than Snorkel for this task. As a result of our experiments with five different classifiers, RoBERTa outperformed the other four classifiers both in supervised and weakly supervised tasks. We tested the weak labels' utility for fake news detection with help of the NELA-GT-2019 data set and a manually created test set where it has been made publicly available. We observed that the more weak labels we introduced, the more the classification performance dropped. However, this decrease is not significant. Therefore weakly supervised learning may be a suitable method to use in the absence of labeled data. More research is necessary to investigate successful ways to blend weak labels without compromising performance.

As a future work, we intend to use additional data sets to verify our findings. Further, we will explore how to effectively use confidence score to estimate weak label's effect.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Improving the Usability of Tabular Data Through Data Annotation, Repair and Augmentation

Rabeb Abida(B) and Anthony Cleve

PReCISE, NaDI, Faculty of Computer Science, University of Namur, Namur, Belgium {rabeb.abida,anthony.cleve}@unamur.be

Abstract. In recent years, a rapidly increasing amount of information has been made publicly available in tabular form on the Web. Many of these data are not usable due to their poor quality (e.g., misspelled or missing values, missing or incomplete metadata, and missing meaningful columns). Solutions have been proposed in the literature to address these data quality issues, but there is still a lack of all-in-one approaches that can fully solve them. Therefore, users need to use several methods to solve these data quality issues. In this paper, we present an all-in-one and automatic approach called SINATRA that helps to bridge this gaps by providing the following features: *data annotation* (to address misspelled and incomplete metadata issues), *data repair* (to address missing values (data) issues), and *data augmentation* (to dynamically add meaningful columns and corresponding cell values to the dataset). An evaluation of the SINATRA approach based on datasets from a state-of-the-art benchmark shows promising results in terms of F1-measure and precision.

Keywords: Usability *·* Tabular data *·* Data annotation *·* Data repairing *·* Data augmentation

## 1 Introduction

Nowadays a vast amount of information is provided on the Web in unstructured text, semi-structured data, and more structured data in the form of tables [2,4, 10,12]. They can sometimes be difficult to use due to data quality issues, such as misspellings and missing metadata, ambiguity in table cells, missing cell values, and missing significant columns [4,6–8,10,12].

Several methods have been proposed in the literature to solve the aforementioned issues. On the one hand, the use of *Semantic Table Annotation* (STA), also known as *data annotation*, consists of assigning semantic tags from knowledge graphs (KGs) (e.g., Wikidata [15] and DBpedia [3]) to the data columns elements. The *data annotation* has proven to effectively solve the problem of spelling errors and missing or incomplete metadata [8–10,12,13]. On the other hand, *data repair* handles the problem of missing cell data (values), and *data augmentation* adds meaningful columns and corresponding cell values to the data. As part of the "Tabular Data to knowledge Graph Matching" competition [9], some approaches have implemented the STA process, such as [8,10,12], but they have not incorporated *data repair* and *augmentation* phases. Meanwhile, other works such as OpenRefine<sup>1</sup> and Magic [13] propose a system that is capable of both annotating and augmenting a dataset, but they do not support any *data repair* phase.

Despite the systems proposed in the literature to solve these data quality issues, there is still no all-in-one approach that can handle them, and nor are there other features that can further support the STA process. Therefore, users need to use multiple methods to tackle these problems.

In this paper, we present an all-in-one and fully automatic proposal called SINATRA (SemantIc aNnotation AugmentaTion and RepAir) that helps fill these gaps by providing the following features:

(i) *data annotation* is used to resolve spelling errors and missing or incomplete metadata. It is based on the STA process, which consists of three main tasks: Column type Annotation (CTA) (Fig. 1c), Column property annotation (CPA) (Fig. 1a) and Column Entity Annotation (CEA) (Fig. 1b). They assigned the data elements to the concepts in the knowledge graph (DBpedia KG), as shown in Fig. 1. To describe each task in the STA process [12], we consider a table of real dataset<sup>2</sup> in Fig. 1, which presents the names of the presidents (col1) and their place of birth (col2).

Fig. 1. Data annotation. Tabular data (black) is annotated with the properties (magenta), entities (blue), and types (green) from DBpedia as asked in the CPA (a), CEA (b), and CTA (c) tasks respectively. (Color figure online)

(ii) *data repair* is used to handle missing or incomplete cell values in the dataset. It is based on a method that applies SPARQL queries to fetch missing cell values from the DBpedia KG. Figure 2 shows an example of the data repair phase by adding a cell value "http://dbpedia.org/resource/Honolulu".

<sup>1</sup> https://openrefine.org/.

<sup>2</sup> https://tinyurl.com/4hrx6s48.

(iii) *data augmentation* is used to dynamically add meaningful columns and their corresponding cell values to the dataset. It is based on a method that applies (i) SPARQL queries to fetch the property URIs (CPA) of the new columns proposed by users and (ii) SPARQL queries to fulfill the corresponding cell values of the newly added columns. Figure 2 shows an example the data augmentation feature by adding a new column "http://dbpedia. org/ontology/birthDate".

Fig. 2. Example of *data repair* by adding cell value "http://dbpedia.org/resource/ Honolulu" (light green) and *data augmentation* by adding new column "http://dbpedia. org/ontology/birthDate" (light blue). (Color figure online)

For evaluating our approach, we used some of the datasets proposed by the "Tabular Data to knowledge Graph Matching" [9,10] competition to measure the effectiveness of the SINATRA approach by F1-measure and precision metrics and demonstrate the capability of its features.

The remainder of the paper is organized as follows. Section 2 positions our work with respect to related literature. Section 3 gives an overview of our approach, describes in detail the different phases it covers, and presents its implementation. Section 4 evaluates SINATRA and assesses the effectiveness of its phases. Section 5 concludes this paper and anticipates future research directions.

## 2 Related Work

This section reviews related work on popular approaches and tools that address gaps in data quality issues (e.g., misspelled or missing values, missing or incomplete metadata, and missing meaningful columns). We present them with their respective features, strengths and weaknesses.

Some works have been proposed, mainly with a particular and non-integrated focus on data pre-processing, subject column (Sub\_Col) detection [13]. Furthermore, OpenRefine and [11,14] rely only on their own data (domain-independent) and perform only a few steps of the STI process. They can be classified as supervised (Sup: they exploit already annotated tables for training) and semiautomatic. Other works [8,10,12,13] can be classified as unsupervised (Unsp: they do not require training data) and automatic. They do not provide a userfriendly graphical interface, and manually annotating the data is time-consuming for the user.

The STA process [10] is composed of five steps which are: (i) the data preprocessing, which aims to prepare the data inside the table; (ii) the detection of the Sub\_Col is designed to detect the main column of the table; and (iii) the three sub-steps for the *data annotation*, which are CEA task (Fig. 1b), CTA task (Fig. 1c), and CPA task (Fig. 1a). Other proposals have been made to resolve the gaps in the above-mentioned approaches and perform all the steps of the STA process. In this way, [8,10,12] propose novel techniques to improve and provide high-quality annotations to address the issues of misspelling and missing or incomplete metadata. They used unsupervised learning techniques, which could be applied to general-purpose domains, and utilized Open Source KG that was freely available on the Web (DBpedia). MantisTable [8] used some features to resolve the limitation of the Subject Column (Sub\_Col) task. It allowed users to apply a series of steps to prepare data and used different features to automatically assign the Sub\_Col. MTab [12] tool as an automatic semantic annotation system, could jointly deal with the three tasks CTA, CEA and CPA. It was based on the joint probability distribution of multiple tables to DBpedia KG matching. MTab achieved impressive empirical performance for the three annotation tasks of the STA process and won the first prize at the SemTab challenge [9,10]. MTab did not offer subject column detection but has excellent results and MantisTable did not offer excellent results like MTab but allowed Sub\_Col detection [9,10]. Those systems [8,10–12,14] can not create or add new columns to *augment* the annotation with additional knowledge graph (KG).

However, OpenRefine and Magic [13] have offered systems capable of both annotating and augmenting a dataset. OpenRefine can perform a semi-automatic reconciliation process against any database that exposes a Web service using Reconciliation Service API<sup>3</sup> specification or a SPARQL endpoint. This tool requires the user to manually correct a cell that has multiple entities (CEA). In addition, it is also able to create new columns through facets, where the user has to formulate the URL to fetch the URIs. Magic [13] offered a system capable of annotating a dataset using the interpretable embedding technique and utilized KGs (DBpedia, WikiData). It can be added a column to further *augment* the Tabular Data. It did not do the pre-processing data phase and used techniques, which were already proposed by the state-of-the-art approaches for that particular phase. Magic might not be outperform the existing state-of-the-art techniques to generate such annotations [1]. Despite all their achievements and results, these proposed tools are not in a position to solve the problems of missing cell values. They do not include the *data repair* phase.

In addition, in the R&D community, there is a lack of automated support [2,5], which can combine the appropriate features defined in Table 1 to assist users in overcoming data quality issues.

<sup>3</sup> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API.

Table 1 summarizes the selected approaches and tools that meet certain features: *Data annotation*, *Data repair* and *Data augmentation*, and shows the difference between them and our proposed approach SINATRA.

Table 1. Approaches and tools that support the above features: *Data annotation*, *Data repair* and *Data augmentation*.


SINATRA is a solution designed as an all-in-one and automatic approach based on MantisTable [8] and MTab [12] systems, which will be described in Sect. 3.

## 3 The SINATRA Approach

This section describes a fully automatic approach, which combines all methods and tools into one integrated approach.

Fig. 3. An overview of SINATRA approach (tool).

This proposal overcomes the associated difficulties with data quality on the Web, especially tabular data. More details on the implementation of the approach are available online<sup>4</sup>. It implements its features: *Data annotation*, *Data repair* and *Data augmentation* through the following four phases such as, Data pre-processing and Subject Column (Sub\_Col) detection, Data Annotation, Data repair, and Data augmentation, which Fig. 3 presents an overview of the proposal.

1. During the Data pre-processing and Sub\_Col detection phase, the SINATRA approach takes as input a large number of local Excel or CSV datasets on the user's computer in order to focus the users to automatically prepare the datasets and detect the Sub-Col before applying the data annotation phase. This phase is based on the Mantistable approach [8] and consists of two steps: (i) Data pre-processing step, the process begins to clean and uniform Data inside the table, remove HTML tags, stop words and some character (i.e.," '), turn text into lowercase, delete of text in brackets, and normalize measurements units. Once this step is complete, the system switches to detect (ii) the Sub\_Col. It is as the Subject of relationships among columns, and the annotation of other columns as Objects (Fig. 1 represented Sub\_Col by the orange color). This step starts by determining the literal columns (e.g., address, phone number, URL, color) using regular expressions. Once this step is complete, the system chooses from remaining columns (called Named Entity columns), the subject column (Sub\_Col) based on different statistic features, such as the average number of words in each cell, fraction of empty cells in the column, the fraction of cells with unique content, and distance from the first-named entity column [8]. More details on those steps can be found in [8]. Once the phase has finished, it moves on to the second phase, which consists of annotating the dataset.

2. Data Annotation phase aims to automatically annotate Tabular data elements with DBpedia KG (Fig. 1). This phase relies on the MTab approach [12] to generate the three tasks: the Column Entity Annotation (CEA), whose task is to map table cells (values) to entities in DBpedia (Fig. 1b); the Column property annotation (CPA) to map column-pairs to an ontology property (Fig. 1a); and the Column type Annotation (CTA) whose task to map table columns to an ontology class (Fig. 1c). The mapping process in MTab is based on the joint probability distribution of multiple tables to KG matching. It improves the matching by using multiple services including, DBpedia Lookup, DBpedia endpoint, and WikiData lookup, as well as a cross-lingual matching strategy. Ths mapping is done in six steps. (i) The first step estimates the most candidate entities (CEA) that were found by those different search services. (ii) The second step is to infer the most classes (CTA). It estimates the entity columns and the numerical columns. If the vote returns a text or integer tag, then the column is of type entity otherwise it is numeric [16]. (iii) The third step establishes the relationships between the different columns (CPA) using the DBpedia Endpoint. (v) Step five is the selection of the highest probabilities of the candidates

<sup>4</sup> https://github.com/123rabida123/SINATRA-Annotation-Repair-Augmentation.

(CEA) in step four to establish their relationship (CPA) via a majority vote. (vi) Step six corresponds to the selection of the highest probabilities of the candidates (CEA) in step four to establish their type (CTA) via the majority vote. More details about each step of MTab can be found in [12]. Our contribution in the first two phases is that combined the strengths of MantisTable and MTab to perform both sub-steps.

MTab does not offer a Sub\_Col detection phase but has excellent results in annotating data solves misspelling issues; and MantisTable does not offer excellent results like MTab but allows Sub\_Col detection.

Once the data annotation phase completes, we get an annotated dataset, but some cells in this dataset still have null values "nan" (Fig. 4a). Hence, we can observe the MTab system's shortcoming, which cannot add the missing cell values in the datasets, as shown in the example in the screenshots (Fig. 4a).

3. Data repair phase aims to automatically add missing cell entities (values) or undefined values "nan". Our algorithm applies SPARQL queries by taking the cell entity (CEA) of the Sub\_Col and the column property (CPA) (e.g., CEA + CPA) to retrieve the missing cell entities (CEA). An example of a SPARQL query to get the missing cell entity of the first row in the above dataset (Fig. 2).

In some cases, the query returns ambiguous entities. In this case, our algorithm calculates the pre-score of each entity using the *confidence-score* (CFS) of the Sub\_Col entity and the cell entity, and determines the relationship. If there is a relation (CPA) between them (Sub\_Col entity and Cell entity), the CFS increases by 1. For example, CFS (honolulu) = 1, CFS (Honolulu) = 1 and there is a relation between "barack\_Obama" and "Honolulu", hence CFS = 2. The SPARQL query (Listing 1.1) retrieves an object for the content of the column "http://dbpedia.org/ontology/birthPlace" (Property/Predicate) and the subject of the first row "http://dbpedia.org/resource/barack\_Obama" from DBpedia KG, where the cell entity (object) retrieved by the query (Listing 1.1) is "http://dbpedia.org/resource/Honolulu" (Fig. 4b).

```
{
PREFIX dbr : <h t tp : / / dbpedia . o r g / r e s o u r c e />
      SELECT ?object
       WHERE
            { <h t tp : / / dbpedia . o r g / r e s o u r c e /barack_Obama>
               <http :// dbpedia . org / ontology / birthPlace>
               ?object
            }
}
```
Listing 1.1. SPARQL query to retrieve a cell entity (Object).

4. During the Data augmentation phase, the system allows the user to add relevant columns to the annotated dataset (Fig. 2). The user simply enters a word "new-Column" (Listing 1.2) to choose a CPA (URI of the new column) in the proposal list of this approach. For the same word (e.g., new-Column = "birth"), there can be several URIs (CPA) that appear in this list, such as: "http://dbpedia.org/ontology/birthDate" and "http://dbpedia.org/ ontology/birthDeath". The user chooses the one CPA, and SINATRA will be added as a new column to the dataset, or she/he can enter the name of the column exactly as "birthDate". Therefore, the system allows the user to add the chosen CPA "http://dbpedia.org/ontology/birthDate" if it is not already in this annotated dataset (Fig. 4c). The algorithm has created a list of CPA proposals, where, each time the query (Listing 1.2) returns a CPA (Predicate has an rdf:property), which contains a word proposed by the user, it stores it in this list.

```
{
PREFIX dbr : <h t t p : / / d b p e d ia . o r g / r e s o u r c e />
       SELECT ?predicate
        WHERE {
                    ?predicate a rdf : Property
                   FILTER ( REGEX ( STR ( ? p r e d i c a t e ) , h t t p : / / d b p e d ia . o r g / o n t o l o g y / , i ) )
                   FILTER ( REGEX ( STR ( ? p r e d i c a t e ) , " ␣+␣new−Column␣+␣ " , i ) )
                 }
ORDER BY ?predicate
}
```
Listing 1.2. Generic query to detect predicates from a SPARQL endpoint to add column.

Once the user chooses a CPA, the system creates a new empty column and then applies the same SPARQL queries (Listing 1.1) of the *data repair* phase to fulfill the corresponding cell entities of the newly added column.

Fig. 4. Screenshots of the *data annotation*(a), *data repair*(b), and *data augmentation*(c) features of SINATRA.

(c) Data augmentation

According to the user's request, the data augmentation phase can create more than one column, as illustrated in step 5 of the (Fig. 3). When the system has finished the previous phases, if there are still datasets to annotate, it restarts the first phase and executes the same phases of the SINATRA process (Fig. 3). SINATRA saves the annotated datasets in a local folder and can be exported in Excel (XLSX) and CSV format.

Figure 4 depicts the graphical interface of SINATRA and focuses on data annotation (a), data repair (b), and data augmentation (c) features. We chose to use the python library *Tkinter* <sup>5</sup> to develop the graphical interface. Visually, tkinter is less pretty than other extensions, but it is better to check the frequency of updates of their source code before choosing one, and its license is more flexible. The implementation of the SINATRA approach, which source code is available on GitHub<sup>6</sup> for future research.

#### 4 Evaluation and Demonstration

This section presents the detail about benchmark datasets, ground truths, and evaluation metrics in Sect. 4.1, followed by the evaluation results and demonstration in Sect. 4.2. This evaluation aims to measure the performance of the *data repair* and *data augmentation* features of the SINATRA approach. In the next section, we present the results of the evaluation and the demonstration of its features.

#### 4.1 Datasets, Ground Truths and Measures

To evaluate this proposal using randomized datasets<sup>7</sup> and the ground truths proposed by the SemTab competition [9,10]. These ground truths are composed of three targets (CEA-targets, CPA-targets, and CTA-targets)<sup>8</sup> matching with DBpedia KG for each annotation task (CEA, CTA, and CPA).

In Table 2, we present the datasets used in our evaluation: Reference of the Dataset, Dataset, #Col, #Rows, and Names of columns.


Table 2. The characteristics of the datasets were evaluated by SINATRA approach

<sup>5</sup> https://docs.python.org/fr/3/library/tkinter.html.

<sup>6</sup> https://github.com/123rabida123/SINATRA-Annotation-Repair-Augmentation.

<sup>7</sup> https://zenodo.org/record/3518539#.YoOgK6hBwuU.

<sup>8</sup> https://www.aicrowd.com/challenges/semtab-2020.

To measure the efficiency of the *data repair* and *data augmentation* features of the SINATRA process, we used the following metrics proposed in [9,10]: Precision (P), Recall (R), and F-measure(F1).

(P), (R) and (F1) of the mapping between the datasets and the DBpedia KG are calculated using the following formula: where a *perfect annotation* refers to the annotation returned by our approach, which corresponds to the annotations of ground truths, a *submitted annotation* refers to the annotation returned by our approach and a *ground truth annotations* corresponds to the number of annotations in the Target Tables. We combined the predefined measures, which represent the harmonic mean between P and R to calculate F1.

*<sup>P</sup>* <sup>=</sup> (#*perfect annotations*) (#*submitted annotations*) (1) *R* = (#*perfect annotations*) (#*ground truth annotations*) (2) *<sup>F</sup>*1 = (2 <sup>∗</sup> *<sup>P</sup>* <sup>∗</sup> *<sup>R</sup>*) (*<sup>P</sup>* <sup>+</sup> *<sup>R</sup>*) (3)

#### 4.2 Evaluation Results and Demonstration

This section evaluates and demonstrates the performance of the SINATRA approach's features. For more details on the results of the evaluation, consulting our Github<sup>9</sup>.

Regarding the evaluation of the *data annotation* feature, this phase of SINATRA is based on the MTab approach. Therefore, it automatically has the same performance as MTab. Table 3 below shows the results of the evaluation of the *data annotation* phase by the MTab approach [12].


Table 3. Evaluation results of the *data annotation* feature by MTab approach.

Our goal in this evaluation is to compare the results of the *data repair* and *data augmentation* phases (Table 4) with the results of the *data annotation* phase (Table 3) to show that they can correctly add the data (entities) and the missing columns.

Regarding the evaluation of the *data repair* feature, we re-based on the same datasets as above (Table 2). In this phase, the evaluation is based on two factors: The first factor (1): we removed some values from those datasets (Table 2) and calculated the performance of this phase. The second factor (2): we added the missing cell values into these datasets during the *data repair* phase. Table 4 below

<sup>9</sup> https://github.com/123rabida123/Datasets-and-Results-of-evaluation-SDA.

shows the performance results of the data repair phase based on the two factors mentioned. From the results of Table 4, we notice the results of the CEA task are reduced in the factor (1) because (R) is reduced (the removed URIs (entities) are in the CEA-targets). Based on the factor (2), we highlight that this phase can add missing data very nicely, where the CEA task has *F*1 = 1 of the datasets (D1 and D2). They have the same results as the *data annotation* feature. The CEA results are represented by the yellow color in Table 4. For the datasets (D5 and D6), the results of the CEA task have been reduced a little bit (from *F*1=1 in Data annotation to *F*1=0*.*987 in Data repair), because some URIs were not perfect or were not available in the CEA-targets. The CPA task is represented by magenta color and the CTA task is represented by cyan color, which have no variation in both factors. They have the same results as the *data annotation* feature in Table 3.


Table 4. Evaluation results of the *data repair* and *data augmentation* features.

Regarding the evaluation of the *data augmentation* feature, we re-used the same datasets as above (Table 2). The evaluation of the data augmentation feature is based on two factors: In the first factor (1), we removed every second column from those datasets (Table 2) and calculated the performance of this phase (without the second columns). In the second factor (2), we added the missing columns into these datasets. Table 4 above shows the performance results of this phase based on the two factors mentioned: whether this proposal is able to add exactly the deleted column in each dataset. From the results of the factor (1) in Table 4, we notice that the results of the CEA, CPA, and CTA tasks are more reduced because (R) is reduced (the removed URIs (entities) are in the targets). In addition, we notice from the results of the factor (2) in Table 4, that this feature is able to add the missing column very well, where the CEA, CPA, and CTA tasks of the datasets (D1, D2, and D5) have the same results as the *data annotation* feature in Table 3 are represented by the yellow color. The magenta color represents the results of the CPA task, and the CTA task is represented by the cyan color of the datasets (D1, D2, D3, D4, and D5). They also have the same results as the *data annotation* feature. Thus, the *data augmentation* feature is perfectly able to add missing columns to the datasets. For the datasets (D3, D4, and D6), the results of the CEA task were slightly reduced, because some URIs were not perfect or were not available in the CEA targets.

#### 5 Conclusion and Future Work

In this paper, we present an all-in-one and automatic approach, to be called SINATRA, that seeks to improve the usability of Tabular data through *Data annotation* (relying on an existing tool Mtab [12]) maps Tabular data elements to concepts in DBpedia KG to solve the issues of misspelling and missing or incomplete metadata. *Data repair* handles missing cell values in the Tabular data by fetching the corresponding concepts from DBpedia. *Data augmentation* allows the user to dynamically add the relevant columns and the corresponding cell values to the data. The evaluation results show that the SINATRA approach was able to annotate, repair, and augment the structured data.

In the near future, we plan to compare our proposal with other existing methods and tools, and extend it with additional features, such as (1) integrating additional knowledge graphs such as WikiData, LOV, Geonames and YAGO to improve the annotation, (2) evaluating the performance of our approach on other open datasets, (3) generating a RDF file of the annotated dataset to publish in Linked Open Data, and (4) providing a visualization graph to enhance the understanding on the relatedness between the concepts of the RDF file.

Acknowledgements. Rabeb Abida is funded by a CERUNA grant from the University of Namur, Belgium. Anthony Cleve is a professor in information system evolution at University of Namur, Belgium, where he heads the data-intensive system evolution lab. He is currently a visiting professor at Universitá della Svizzera italiana, Switzerland. Anthony is a member and former president of the PReCISE research center, and a member of the Namur Digital Institute (NADI). He co-edited the book "Evolving Software Systems", published by Springer in 2014.

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **AI in Biological Applications and Medicine**

# **Detecting Human Embryo Cleavage Stages Using YOLO V5 Object Detection Algorithm**

Akriti Sharma1(B) , Mette H. Stensen<sup>5</sup> , Erwan Delbarre<sup>2</sup> , Momin Siddiqui<sup>4</sup> , Trine B. Haugen<sup>2</sup> , Michael A. Riegler<sup>3</sup> , and Hugo L. Hammer1,3

<sup>1</sup> Department of Computer Science, Faculty of Technology, Art and Design, Oslo Metropolitan University, Oslo, Norway

akritish@oslomet.no

<sup>2</sup> Department of Life Sciences and Health, Faculty of Health Sciences, Oslo Metropolitan University, Oslo, Norway

<sup>3</sup> Department of Holistic Systems, Simula Metropolitan Center for Digital Engineering, Oslo, Norway

<sup>4</sup> Department of Computer Science, Jamia Millia Islamia, New Delhi, India

<sup>5</sup> Fertilitetssenteret, Embryology, Oslo, Norway

**Abstract.** Assisted reproductive technology (ART) refers to treatments of infertility which include the handling of eggs, sperm and embryos. The success of ART procedures depends on several factors, including the quality of the embryo transferred to the woman. The assessment of embryos is mostly based on the morphokinetic parameters of their development, which include the number of cells at a given time point indicating the cell stage and the duration of each cell stage. In many clinics, time-lapse imaging systems are used for continuous visual inspection of the embryo development. However, the analysis of time-lapse data still requires the evaluation, by embryologists, of the morphokinetic parameters and cleavage patterns, making the assessment subjective. Recently the application of object detection in the field of medical imaging enabled the accurate detection of lesion or object of interest. Motivated by this research direction, we proposed a methodology to detect and track cells present inside embryos in time-lapse image series. The methodology employed an object detection technique called YOLO v5 and annotated the start of observed cell stages based on the cell count. Our approach could identify cell division to detect cell cleavage or start of next cell stage accurately up to the 5-cell stage. The methodology also highlighted instances of embryos development with abnormal cell cleavage patterns. On an average the methodology used 8 s to annotate a video frame (20 frames per second), which will not pose any delay for the embryologists while assessing embryo quality. The results were validated by embryologists, and they considered the methodology as a useful tool for their clinical practice.

**Keywords:** Track cell division *·* Detect cell in human embryo *·* Detect cell cleavage stage *·* Object detection

#### **1 Introduction**

In Assisted Reproductive Technology (ART) procedures, eggs are fertilized outside the body. The fertilized eggs called embryos are cultivated in a controlled environment before being transferred to the woman. The selection of an embryo for transfer is based on the embryologist's evaluation of its quality. Embryos are typically assessed using morphological features such as cell count being specific to a cell stage or the size of the cells and the duration of the different cell stages [5]. The morphokinetic parameters include the period of successive embryonic cell divisions leading chronologically to the 2-cell stage (for two cells), 3-cell stage (for three cells), 4-cell stage (for four cells), 5-cell stage, 6-cell stage, 7-cell stage, 8-cell stage, 9+-cell stage and finally morula, which is a compacted structure made of small size cells in the range of 8—16 followed by blastocyst which is made up of about hundred cells. The cell stages of embryo development are shown in Fig. 1. The duration of different cell stages has proved to be significant in evaluating the embryo quality [18]. A simple way for calculating the duration is by counting the number of cells and tracking cell division, which requires the continuous monitoring of the developing embryo. The time-lapse technology (TLT) systems now used in many clinics are capable of providing digital images of embryos at frequent time intervals [14]. In a vast majority of cases, the output from TLT systems is still analysed by embryologists who manually annotate morphological features, abnormal cleavage pattern that are correlated to embryo quality [6] and duration of cell stages, thus introducing intra- and interobserver variability [17]. Some TLT systems though, allow computer-assisted annotation which might reduce the intra- and interobserver variability among embryologists [9], but the usage of the feature can incur additional costs. Recently, the application of object detection algorithms in the field of medical imaging has proven to provide fast and accurate results [10,12].

**Fig. 1.** Cell stages of human embryo development.

In this study, we have developed an approach to locate cells in the images depicting embryonic development. The approach was developed and evaluated based on TLT images. The images were the frames of TLT videos. The suggested approach was able to count the number of cells in each TLT frame, track the detected cells and cell divisions in consecutive frames. Our approach also identified different cell stages. The suggested approach employed YOLO v5 to detect cells present in the frames. The approach further tracked each individual cell across different cell stages by marking each cell boundary with distinct colored circular overlays. The distinct color scheme helped the embryologists in tracking individual cells, their cell divisions and identifying cell cleavages over the course of the TLT video. The average processing time taken by our approach was 8 s for a TLT video. The methodology could also detect abnormal cleavage pattern such as direct cleavage [16] and reverse cleavage [11].

We used six performance metric to evaluate the software's performance in detecting cell stages and the software performed best for 2-cell stage detection and the performance was reducing with increase in the number of cells inside the embryo. The performance of our method was validated by embryologists and they considered tracking of cells with colored overlays as useful. The main contributions of this study were: (i) Using our method, the embryologists could accurately detect cells, track cell divisions and determine cell cleavage stages up to 5 cells; (ii) our approach has the potential for detecting abnormal cleavage patterns in human embryo development; and (iii) this approach could generate accurate annotations for the morphokinetics related to cell cleavages and cellstages in 8 s for TLT videos with the frame rate of 20 on an average.

#### **2 Methods and Materials**

#### **2.1 Data**

The dataset was collected retrospectively at Fertilitetssenteret, a fertility clinic in Oslo, Norway, and consisted of TLT videos of human embryo development. The embryos were cultured inside a time-lapse system called EmbryoscopeTM (Vitrolife, Denmark).

**Time-Lapse Imaging.** The introduction of TLT in ART practices enables continuous monitoring of embryos throughout their whole culture period. EmbryoscopeTM is an incubator equipped with an inbuilt microscope and a camera. For each embryo placed inside the incubator, the system took 8-bit images at several focal planes (number varying between 3 or 5) between every 10–15 min. Each 8-bit image has a resolution of 500 *×* 500 pixels. By using timelapse imaging (TLI) images, embryologists gets insights into the morphokinetics associated with the embryo cell development without removing embryos from the incubators [7]. Later for every TLT video the embryologists analyzed each video's frame (8-bit image) and manually annotated starting of an observed cellstage. The observed cell stages were as: 2-cell, 3-cell, 4-cell, 5-cell, 8-cell, 9<sup>+</sup>-cell, morula and blastocyst. In this study, we used 890 TLT videos from which we extracted the frames corresponding to the annotated start of a cell cleavage stages. It resulted in total of 2785 images and each cell stage had 350 images except for Blastocyst with 335 images. We denoted this as Dataset I and used it to train the object detection algorithm. A second dataset, Dataset II, was also created comprising of 11 other TLT videos. We annotated this dataset for the start of observed cell-stages using our methodology. Dataset II was used as an independent dataset.

**Abnormal Cleavage Patterns.** A successful fertilization between sperm and egg results in a fertilized egg which over next few days undergoes a series of cell division progressing through the cell stages. The embryo should cleaves every 12 or 24 h. Thus, by the time an embryo has reached Day 3 of development, it should be between four and eight cells. [1]. The continuous monitoring of embryo morphology using TLT revealed certain abnormal cell cleavage pattern [4]. One such pattern is reverse cleavage which is defined as a decrease in the number of cell during cell division. This means that cells in a cell stage fused together to form a cell (reducing cell count) and they cleaved again after that [11]. Another abnormal cleavage pattern is direct cleavage which occurs when a cell divides directly into three more daughter cells [16]. Such abnormal cleavages correlate with impaired embryo development and implantation potential [13,19] and should be detected.

**Ethical Consideration.** A fully anonymized data was collected after the approval by Regional Committee for Medical and Health Research Ethics - South East Norway (REC). All experiments were performed in accordance with the guidelines and regulations of REC, and the General Data Protection Regulations.

#### **2.2 Object Detection**

Object detection is fundamental task in image processing. It is a form of image classification where method predict objects in an image using bounding boxes around the objects. It is referred as the detection and localization of objects in an image, where the objects belong to predefined classes [2]. In recent years, due to contribution of deep learning (DL), and especially convolutional neural network (CNN), object detection models outperforms specifically in field of medical imaging [12]. The convolutional kernels in the models extract features, layer by layer and obtain the probabilities of candidate bounding boxes belonging to different classes. The object detection models can be categorised as: one stage network such as You Only Look Once (YOLO) [15] and two stage network such as Fast R-CNN [8]. A two stage object detection model breaks down object detection into two task, first detects possible object region and then classify the image in those regions into predefined classes [2]. Whereas, YOLO as a one stage network, proposes the use of an end-to-end neural network that processes the whole picture by dividing it into N grids with equal dimensional region. Each of these grids predicts the probability of object classes being present in the grid along with object label and bounding box coordinates relative to grid's cell coordinates. The bounding boxes are weighted by the expected probability of each object. Then, YOLO using non maximal suppression technique to suppress all bounding boxes with lower probability scores. YOLO uses the metric mean Average Precision (mAP) for measuring the decision performance while predicting bounding boxes for object classes. mAP is the mean of the Average Precision (AP) for all object classes. AP is the summary of the precision-sensitivity curve for YOLO v5 predicting bounding box per object class into a single value that provides average of all precision values [2]. If we want to apply object detection in real time videos at fertility clinic, algorithm speed should be fast. YOLO is a much faster algorithm than its counterparts [2]. Thus, in this study, we used YOLO v5 to detect object classes: cell, morula and blastocyst in the frames of TLI videos. The annotated location of the object classes in the training images (Dataset I) and YOLO v5 predictions on Dataset I were reviewed by embryologists. The mAP for object cell was 0.65, morula 0.78 and for blastocyst was 0.80.

#### **2.3 Colored Circular Overlay Algorithm**

In this section we explained the suggested algorithm to add colored circular overlay to embryo cells. Our approach first used YOLO v5 to detect cells present in frames of TLI videos. Once we got bounding boxes or coordinates for the detected cells, then we used OpenCV library to mark each cell boundary with different colored circular overlay. After detecting cells with distinct colored overlay, the methodology computed the cell count and recorded coordinates for each cell. The assigned color to a cell was maintained until the cell divided into daughter cells. Later, each daughter cell got a distinct coloured overlay for itself. The methodology recognized the daughter cells as unique individual cells and kept track of them in the succeeding frames using the color of the overlay. After processing the whole TLI videos, the methodology provided a new version of the input TLI video, where the frames had colored overlays on detected cell boundaries in each video frames.

If cell count remained same between consecutive frames, for the current frame, our methodology calculated proximity between each cell in the current frame to the cells detected in the preceding frame. The proximity was calculated using the difference between the coordinates of two cells, the first cell from the current frame and the second cell from preceding frame. If the calculated proximity lay within a specific threshold, the methodology copied color scheme of the cells from preceding frame to the cells in current frame. This way cell tracking using colored overlays was performed. The proximity threshold used in our algorithm was 0.10 for cell count less than 4, 0.05 for count greater than 4.

If cell count differed between consecutive frames, our methodology checked whether the current frame has higher cell count than the preceding frame. If true, then there was a possibility that one of the cell might have cleaved into daughter cells. The methodology detected the parent cell from preceding frame using same concept of proximity and assigned color of parent to daughter cells recognizing the frame with cell division. The methodology, then, annotated the current frame as the start of cleavage of a cell-stage. The cell-stage was corresponding to the number of detected cells. If false, or the cell count for the current frame being lower than the cell count of preceding frame, the methodology still calculated proximity between cells and copied the matching color scheme. The lower count the for current frame could be case of abnormal cleavage or few cells not being detected by YOLO v5.

#### **3 Results**

To test our methodology we used Dataset II for cell tracking and detecting cell cleavage stages. The methodology processed each video in the dataset and generated a corresponding video with colored circular overlays on detected cells in every video frame. The embryologists could track a cell using the color of overlay for that cell. Starting from the first frame, our methodology assigned distinct color to each cell and that color was maintained up until the cell divided. Then the daughter cells were also assigned different color overlays from the next frame. In Fig. 2, we present few frames extracted from a video generated by our methodology present in the bottom row. The top row shows actual video frames. The frames in the bottom row, have colored circular overlay marking the boundary of detected cell and same color scheme is maintained until cell division. The cell division can be seen in frame 5 and 7 of Fig. 2 and distinct colored overlay for each cell in the succeeding frames 6 and 8 of Fig. 2.

**Fig. 2.** Extracted frames from TLI video of embryo development til 4-cell stage. The top row shows actual video frames and the bottom row shows our method's output with colored overlay on each cell. In frame 1, single cell divided into 2 cell as shown in frame 3 and 4. The yellow colored cell divided in frame 5. From frame 6, our method annotated 3-cell stage, each cell with distinct color. The blue colored cell starts to divide in frame 7 and 4-cell stage was annotated from frame 8. (Color figure online)

#### **3.1 Comparison with Embryologists**

Two embryologists independently validated the performance of our methodology. To this end, they verified the number of detected cells, in each frame of the generated videos. They also verified that the starting of cell stage, as annotated by the methodology, was either exact match to their annotation or varied by only a few frames on average. It was observed that our methodology detected cells, tracked cell division and precisely annotated the start of each cell stage up up to 5-cell one. For stages with cell count above five, the annotated start of cleavage was later than actual by 9 to 10 frames on an average. In Fig. 3 we present some frames extracted from a video with embryo development til 9-cell stage. Our methodology could detect cells and tracked cell divisions accurately up up to 5 cell-stage, as seen from frames 1 to 8 of Fig. 3. When cell count exceeded five the methodology confused between overlapping cell boundaries and either missed detecting a cell (frame 12 of Fig. 3) or detected incorrect location for cell (yellow circle in frame 9 of Fig. 3).

**Fig. 3.** Extracted frames from TLI video of embryo development til 9-cell stage. The top row shows actual video frames and the bottom row shows our method's output with colored overlay on each cell. The green colored cell divided in frame 4. From frame 5, our method annotated it as 3-cell stage and tracked the cell division from frame 6: blue colored overlay. The 4-cell stage was annotated in frame 7. In frame 9, incorrect cell location was detected: yellow overlay but correct cell count was detected in frame 10 and 11. Again, a cell was missed in frame 12. (Color figure online)

#### **3.2 Cell Counting Performance**

Next, we evaluated the performance of our methodology using the following six performance metrics: sensitivity (SENS), precision (PREC), specificity (SPEC), accuracy (ACC), F1-score (F1), and the Matthews correlation coefficient (MCC). Using multiple metrics provides a more reliable and robust insight into the real capabilities of our approach. We measured the efficiency of the methodology in reporting the correct cell count in a frame, tracking of cell division and annotating the start of a cell cleavage stage. The results were validated by the embryologists using the criteria based on cell count, detected cell boundary, for cell division picking correct parent for the daughter cells and matching our methodology's annotation with their annotation for the start of a cell-stage. The metric MCC is a reliable statistical rate giving high scores only if the prediction (frame belonging to a cell stage) obtained good results in all of the four confusion matrix categories [3]. MCC measures the difference between actual label (frame annotated by embryologist for belonging to a cell stage) and predicted label (frame annotated by our methodology for belonging to a cell stage). A MCC value lies between *−*1 to 1. A negative MCC value indicates that there is no agreement between actual and predicted label. While MCC value around zero indicates model decides randomly and a value above zero indicates correct prediction. Our methodology obtained an MCC of 0.77 for predicting start of cleavage stages up up to 5-cell stage. We observed that sometimes the overlay color changes for cells abruptly between frames or wrong parent was chosen for the daughter cells. We labelled these predictions as incorrect. Next, to quantify the performance of our methodology we used the performance metrics as listed in Table 1. The methodology performed best for 2-cell stage (precision = 0.91, sensitivity = 0.98, highest F1-score = 0.95). The detectiom of 1-cell stage was quite accurate (precision = 0.99, sensitivity = 0.86, high F1-score = 0.91) but, a few instances of 1-cell stage were misclassified as morula. A few instances of 4-cell stage were also misclassified with 3-cell and 5-cell stage, but our methodology mostly detected 4-cell stage accurately (high precision = 0.87, low sensitivity = 0.62, high F1-score = 0.73). A higher number of instances of 3-cell and 5 cell stage were misclassified with other stages, still the detection of the cleavage stage was better than random: 3-cell (average precision = 0.46, high sensitivity = 0.93, average F1-score = 0.61), 5-cell (high precision = 1.0, low sensitivity = 0.31, average F1-score = 0.47). For cell stages with cell count greater than 5 we observed poor performance of our methodology as sensitivity, precision and F1-score for the stages was below 0.40. Further, we did not evaluate our methodology for these cell stages.


**Table 1.** Evaluation results of the performance metrics on Dataset II for detecting embryo cell cleavage stages using our methodology

We observed the similar pattern in the receiver operating characteristic (ROC) curve for cell stages up upto 5-cell stages. As shown in Fig. 4 the area under the curve (AUC) is maximum for 2-cell stage and minimum for 5-cell stage. Thus, our methodology performed best in detecting and tracking cell division for 2-cell stage and is worst for 5-cell stage.

**Fig. 4.** ROC curve for the software detecting embryo cell cleavage stages on Dataset II.

#### **3.3 Computational Efficiency**

We also calculated the processing time taken by our methodology. The processing time included the duration for video processing and generating its corresponding video with colored overlays on Dataset II. On an average 8 s were required. If we divide Dataset II into two groups: (i) A: videos upto 5-cell stage. (ii) B: videos having cell stage with cell count greater than five. Our methodology, for A reported 4 s and for B reported 19 s as an average processing time. The average number of processed frames per second (fps) for videos in Dataset II was 20, 8 fps for A and 33 fps for B. This is far quicker than the real-time progression of embryos, and the processing time do not pose any practical delay for the embryologists using the method for embryo assessment.

#### **3.4 Anomaly Detection**

We further evaluated whether our method could detect anomalies in the embryo development. In Dataset II, there were two TLI videos with instances of direct cleavage and reverse cleavage. Figure 5 shows frames from one of these video where our method detected anomalies. For direct cleavage the single cell divided into 3 cells. Reverse cleavage was observed on 3-cell stage (2 cells fused into one and later divided again into 2 cells) and 4-cell stage (2 cells fused into one cell). The abnormal cleavage pattern detected by our methodology was validated by the embryologists as correct detection.

**Fig. 5.** Extracted frames from TLI video of embryo development til 4-cell stage. The top row shows actual video frames and the bottom row shows our method's output with colored overlay on each cell. First two frames from left shows direct cleavage of single cell to 3 cells. The next three frames show reverse cleavage from 3 cells to 2 cells and again 3 cells. The last two frames on right show reverse cleavage from 4 cells to 3 cells.

#### **4 Discussion**

Our method detected cells, cell divisions and cleavage stages up to 5 cells. For single cell or 1-cell stage detection, it performed with high precision, but also misclassification with the stage morula was observed. This could be attributed to the compacted structure of morula that has high resemblance to 1-cell stage. Our approach performed best in the detection of 2-cell stage, and the performance reduced on much higher scale while detecting cells or reporting cell stages having cell count greater than five. The methodology detected those cell stages later than their actual cleavage and it was because of increased overlapping between neighbouring cell boundaries. With the higher cell count, the structure of a cellstage gets more complex and cells tend to lie on top of each other, making cell counting more difficult. The methodology considered two cells as one because YOLO v5 is trained to analyse a 2-D image and the depth information (3-D view) directing towards potential overlap is missing. We observed that for cell stages three and five, there were high fluctuation in reported values for the performance indicators such as sensitivity and precision. 3-cell stage had lower precision and higher sensitivity while the 5-cell stage had lower sensitivity and higher precision. For these stages, the imbalance in the performance of our approach was because the overlay's color changed for cells abruptly between the frames.

Once a cell stage was detected using our approach, in the consecutive frames less number of cells were detected by YOLO v5, and then again the correct count was reported. Thus, the training dataset for object detection need to be more comprehensive. If there is some noise in the images or some situations that are not covered by the training data, the robustness of the object detection model will be reduced [12]. Our methodology was time efficient and could generate videos with colored overlays with annotated cell stages in 8 s on average for Dataset II videos with 20 fps on average. In comparison, the camera in time-lapse incubator captures images of an embryo after 10–15 min. This shows that the inclusion of our methodology to process TLT videos will not bear any additional time delay and will support embryologist in decision making. Thus, our approach can be included in real time.

The methodology can help in reducing the subjectivity associated with the assessment of an embryo's quality. The methodology also proved potential for detecting abnormal cleavage pattern which can be useful for embryologist while assessing embryo's quality and viability to be transferred to female body.

## **5 Conclusion**

Object detection proved to be pragmatic for ART. Overall, our approach successfully detected cells, effectively tracked cell divisions and accurately determined cleavage stages up up to 5 cell-stage. Our approach was time efficient and can be used in the real time processing of TLI videos without introducing an additional time delay. Tracking cell division using our methodology seems to have potential for detecting abrupt cleavage patterns such as direct or reverse cleavage. Qualitative evaluation by embryologists resulted in the overall verdict that the methodology is useful and seems promising for clinical practice. We also hypothesise that using a larger dataset for training and including images from other focal planes, to provide depth information, will enable our methodology to detect overlapping cells and cell cleavage stages with cell count greater than five.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Phenotyping of Cervical Cancer Risk Groups via Generalized Low-Rank Models Using Medical Questionnaires**

Florian Becker1,2(B), Mari Nyg˚ard<sup>3</sup>, Jan Nyg˚ard<sup>3</sup>, Age Smilde1,4, and Evrim Acar<sup>1</sup>

<sup>1</sup> Simula Metropolitan Center for Digital Engineering, Oslo, Norway florian@simula.no

<sup>2</sup> Oslo Metropolitan University, Oslo, Norway

<sup>3</sup> Cancer Registry of Norway, Oslo, Norway

<sup>4</sup> Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, Netherlands

**Abstract.** The purpose of this study is to uncover cervical cancer (CC) risk phenotypes from self-reported lifestyle questionnaires and screening data. In general, computational phenotype discovery aims to find subgroups among individuals that share distinctive characteristics by analyzing electronic health records (EHR). This can benefit the understanding of a disease as well as uncover risk factors and provide possibilities for preventive action. The features in the *women* (*n* = 6359) by *questionnaire features* (*p* = 29) matrix with missing data are of different statistical data types (e.g., binary or ordinal data). We use so-called *generalized low-rank models* (GLRM) that can address this challenge via different statistical-data-type-dependent loss functions. We show that these models can uncover phenotypes related to cervical cancer risk factors from large-scale questionnaire data.

**Keywords:** Computational phenotyping · Unsupervised learning · Low-rank approximations · Electronic health records · Cervical cancer

## **1 Introduction**

The collection and processing of electronic health records (EHR) has the potential to increase the quality of care and diagnostic value [1,2]. EHR may include, for instance, the medical history, medication, demographics, or other personal or lifestyle meta information. Questionnaires or surveys are one way to gather information about lifestyle choices that might serve as complementary information to other EHR data. Anticipating the adoption of EHR, suitable data mining methods are needed to analyze EHR data and uncover different patient subgroups or *phenotypes* [3]. By design, questionnaires typically include aspects that are assumed or known to be risk factors. The incidence of a disease can be compared by conducting hypothesis tests between different predefined groups, for instance, between smokers and non-smokers. However, in order to uncover previously unknown phenotypes of patients, unsupervised multivariate approaches are needed. Low-rank matrix factorizations, such as principal component analysis (PCA) [4,5] or nonnegative matrix factorization (NMF) [6], are promising tools to analyze multivariate data and reveal underlying patterns in an unsupervised way. These approaches have the advantage of not presuming any kind of groups. Thus, they may allow to discover patient phenotypes and co-factors, i.e., features that co-occur.

However, missing data and different statistical data types of feature columns are challenging problems when analyzing heterogeneous (questionnaire) data. Generalized low-rank models (GLRM) provide a promising framework that was developed recently to address these challenges [7,8]. In this context, generalization stands for the extension of losses beyond the standard quadratic loss. GLRM approximates a heterogeneous data matrix using low-rank score and loading matrices by taking into account the statistical data type of each column. We investigate this idea, and explore whether there is a benefit for computational phenotyping compared to an NMF-based model *agnostic* to data types.

#### **1.1 Cervical Cancer Screening Programme**

Since establishing a coordinated nationwide cervical cancer screening programme in Norway in 1995 the incidence of the disease was substantially reduced [9]. In addition to collecting the screening results, the Cancer Registry Norway sent out a questionnaire to roughly 30,000 women in 2004–2005 and 2011–2012 [10,11]. It comprises questions about lifestyle choices such as drinking and smoking habits as well as questions about contraception usage, sexual activity (e.g., number of sexual partners) and previous history of sexually transmitted diseases (STDs), among others. Together with the screening results from a cytology examination, this data set can provide researchers as well as medical practitioners with valuable insights about demographics, disease progression and patient phenotypes. The complete screening history of a woman <sup>f</sup> can be denoted by {(si, di)} n*<sup>f</sup>* <sup>i</sup>=1, where s<sup>i</sup> is the age at the i-th screening, d<sup>i</sup> is the associated examination result, encoded by diagnosis codes (see Table 1, Appendix) and n<sup>f</sup> is the total number of screenings for f. The (cytological or histological) examination results range from no atypical cells to different categorizations of pre-cancers and cancers. While the screening data is a population-level data set, the questionnaire data covers only a sub-population.

#### **1.2 Uncovering Phenotypes and Co-Factors is Ongoing Research**

While it is known that the human papillomavirus (HPV) causes nearly all cervical cancer cases, different risk factors for such an infection and their interaction among each other are still a relevant topic. Previous studies and reviews have identified various factors that increase the risk of cervical cancer, e.g., the duration of hormonal contraception [12] or the marital status [13]. Early age at first intercourse as well as early pregnancies have been determined to be risk factors in developing countries [14]. A further study has proposed a model according to which the incidence rate of cervical cancer is proportional to the square of time since first intercourse [15]. Some factors, such as smoking, have been identified as *co-factors*, meaning that it increases the cervical cancer risk among HPV positive women [16]. In order to reveal these statistical associations, studies typically use uni- or bivariate tests [17]. However, to uncover more complex phenotypes, multivariate approaches are needed.

In this study, we use GLRMs to analyze a large-scale medical questionnaire data set linked with screening data, and show that GLRMs are a viable method for phenotype discovery in the context of cervical cancer risk groups. We demonstrate that when GLRMs are used to analyze questionnaire data in the form of a *female participants* by *features* matrix, meaningful phenotypes showing statistically significant differences between risk-level subgroups are revealed. One phenotype, for instance, is characterized by the number of sexual partners as well as hormonal contraception usage. Some extracted phenotypes are consistent across models using different number of components. Grouping women based on a phenotype description can potentially be used in the future to personalize cervical cancer screening programs. The ultimate goal is to avoid both too infrequent screening and over-screening. While low-rank models have been used previously for phenotyping EHR data [1,2,18,19] primarily focusing on the analysis of medication, procedure and diagnosis data, the multivariate analysis of self-reported medical questionnaires to reveal phenotypes remains an under-researched and challenging problem.

This study, to the best of our knowledge, presents the first attempt to discover phenotypes from survey data that was collected within a cervical cancer screening programme, using NMF as well as a low-rank model with data-typespecific loss functions.

#### **2 Materials and Methods**

#### **2.1 Questionnaire Description and Preprocessing**

The aspects that are covered in the questionnaire can be roughly grouped into nine categories: contraception, awareness of HPV, smoking, drinking habits, sexual activity, pregnancies, previous STDs and other personal information like marital status and education. The answers to these questions have different *statistical data types*. A question of Boolean type, for example, asks for whether a person smokes, while a further question asks for the age when the person started (or stopped) smoking. In addition to this categorization of feature columns according to their statistical data type, the features can also be categorized according to their *static* or *dynamic* nature. Static features, once reported, do not change over time (e.g., if hormonal contraception was ever used before), while dynamic features (e.g., the number of years of smoking) are time-dependent. To a certain extent, the design of the questionnaire allows to associate the questionnaire features with screening results.

By recording the starting age of a certain habit or the onset of a certain kind of contraception use, the time since the starting age can be computed at a certain later screening time point si.

For each screening si, a subset of the questionnaire features are transformed such that they denote durations or "time since onset". These features are also called *delta-time* features, and the prefix dt is used to denote them. Delta features allow examination results d1,...,dn*<sup>f</sup>* to be associated with questionnaire feature vectors.

Transformed features can only be computed if the starting point for a certain habit lies in the past, given a certain screening time point si. Questionnaire feature rows that do not fulfill this condition are discarded.

To arrive at the final questionnaire data, the feature vector corresponding to the *worst* screening result (diagnosis codes in ascending order, cf. Table 1 in the Appendix) for each female participant is extracted. Rows and feature columns in the data set that contained more than 50% missing values were discarded. For example, questions about different STDs (e.g., chlamydia, gonorrhea) were only answered by relatively few women. The final features that were included in the analysis are shown in Table 3, Appendix. Screening results are heavily skewed towards normals. In order to prevent any low-rank model to primarily model the normal group, only a randomly sampled subset of normals is used. The distribution of risk-level categories in the final matrix in the form of a *women* (n = 6359) by *questionnaire items/features* (p = 29) matrix is shown in Table 1, Appendix.

#### **2.2 Generalized Low-Rank Models**

**Notation:** Scalars are denoted as lowercase letters, vectors as boldface lowercase letters, and matrices as boldface uppercase letters. By xij we denote the (i, j) entry of a matrix **X**. We use **x**i: to denote the ith row and **x**:<sup>j</sup> to denote jth column of an <sup>n</sup> <sup>×</sup> <sup>p</sup> matrix **<sup>X</sup>**. We treat both **<sup>x</sup>**i: and **<sup>x</sup>**:<sup>j</sup> as column vectors.

We use generalized low-rank models to approximate the heterogeneous survey data matrix **<sup>Q</sup>** <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>p</sup> using low-rank female-mode matrix **<sup>X</sup>** <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>k</sup> and a phenotype matrix **<sup>Y</sup>** <sup>∈</sup> <sup>R</sup><sup>k</sup>×<sup>p</sup> with <sup>k</sup> factors, where <sup>k</sup> is often much smaller than min(n, p). In contrast to data matrix **Q**, factor matrices **X** and **Y** are real-valued. The factor matrices are computed by solving the following optimization problem:

$$\min\_{\mathbf{X},\mathbf{Y}} \qquad \sum\_{(i,j)\in\Omega} \mathcal{L}\_j(q\_{ij}, \mathbf{x}\_i^\top \mathbf{y}, j) / \sigma\_j^2 + \lambda\_r \mathcal{R}\_r(\mathbf{X}) + \lambda\_c \mathcal{R}\_c(\mathbf{Y}) \qquad (1)$$
  $\text{s.t.}$  
$$\mathbf{X} \ge 0, \mathbf{Y} \ge 0,$$

where <sup>Ω</sup> is the set of observed entries, <sup>L</sup><sup>j</sup> : (<sup>R</sup> <sup>×</sup> <sup>R</sup>) <sup>→</sup> <sup>R</sup> denotes the entry-wise loss function that is dependent on the statistical data type of the respective column in **Q**, and **X** ≥ 0 indicates that all matrix entries are nonnegative. To balance the unequal scaling across different columns, σ<sup>2</sup> <sup>j</sup> = <sup>1</sup> n*j*−1 <sup>i</sup>:(i,j)∈<sup>Ω</sup> <sup>L</sup><sup>j</sup> (μ<sup>j</sup> , qij ) is introduced, where μ<sup>j</sup> = argmin<sup>μ</sup> <sup>i</sup>:(i,j)∈<sup>Ω</sup> <sup>L</sup><sup>j</sup> (μ, qij ) which is a generalization of the variance that is dependent on the loss function, where n<sup>j</sup> denotes the number of non-missing entries in column j. This means that scaling is not a preprocessing step, instead in order to scale the columns, a small optimization problem needs to be solved to get the {μj}<sup>p</sup> <sup>j</sup>=1, which are then used to compute {σ<sup>2</sup> j }p <sup>j</sup>=1. The {μj}<sup>p</sup> <sup>j</sup>=1 itself are not used in the optimization problem (1), i.e., the columns are only scaled, but not centered. <sup>R</sup>r(**X**) = <sup>n</sup> <sup>i</sup>=1 <sup>r</sup>i(**x**i:) and <sup>R</sup>c(**X**) = <sup>p</sup> <sup>j</sup>=1 <sup>r</sup><sup>j</sup> (**y**:<sup>j</sup> ) denote regularization terms across rows and columns, denoted by the subscripts <sup>r</sup> and <sup>c</sup>, respectively. We use the 1-norm, i.e., <sup>r</sup>i(**x**i:) = ||**x**i:||<sup>1</sup> <sup>=</sup> <sup>k</sup> <sup>j</sup>=1 <sup>|</sup>xij <sup>|</sup> to enforce sparsity across the rows of **X** and columns of **Y**. The reasons for using sparsity are two-fold: Sparsity enforces clustering [20,21] and (together with nonnegativity) a less-arbitrary, more well-posed solution of the optimization problem above. In general, low-rank models are non-convex. Missing data exacerbate the problem of non-convexity and lead to more local minima [22]. Note that the formulation above does not incorporate a weight matrix. Instead, the set Ω contains indices of all available data in **Q**. An equivalent formulation is to use a binary weight matrix that encodes missing and non-missing data.

Low-rank approximations have been extended beyond the minimization of the quadratic loss in the past, e.g., to model Poisson or Bernoulli-distributed data [23]. The framework used in this study, however, facilitates the use of different loss functions as well as imposing constraints on the factors through regularization. Constraints play a crucial role in matrix factorizations since additional constraints are often needed to reveal unique patterns (that can be further interpreted as, e.g., phenotypes, biomarkers). The framework has been used before to investigate autism spectrum disorder phenotypes using hospitalization records [7].

#### **3 Experiments**

We assess the performance of a GLRM-based model in terms of revealing phenotypes from the questionnaire data matrix **Q**. Our results demonstrate that GLRM can reveal phenotypes showing statistically significant differences between cervical cancer risk groups. We also show that both GLRM and an NMF-based model find similar *general risk factors* using a 4-component model. However, when high number of components is used to reveal more phenotypes, GLRM uncovers more phenotypes that are both statistically significant and consistent.

#### **3.1 Implementation Details and Experimental Set-Up**

In order to solve the optimization problem given in (1), we use the Julia package LowRankModels.jl that fits low-rank models using an alternating proximal gradient method [8]. We extended this framework to fit our needs. For instance, we implemented a Kullback-Leibler divergence loss function LKL for count data (cf. Table 2 in the Appendix). To avoid local minima, we use 50 random initializations and the one returning the minimum loss is used. We also validate the uniqueness of **X** and **Y** experimentally by assessing solutions from multiple runs, making sure that factor matrices corresponding to the minimum function values are the same (visually).

In this study, two types of models are used: The one that is defined by the optimization problem (1) using different loss-functions L<sup>j</sup> , and a second one, a na¨ıve counterpart, that uses the same constraints and regularization, but only uses a quadratic loss function across all feature columns. Hence, the second type is nonnegative matrix factorization with additional <sup>1</sup> regularization considered as the naive counterpart of the GLRM. In the following, we use the abbreviation GLRM to refer to the tailored model with statistical data-type-dependent loss functions, and NMF to a nonnegative matrix factorization model with <sup>1</sup> regularization. We explored different regularization parameters for the sparsity regularization, i.e., <sup>λ</sup>r, λ<sup>c</sup> ∈ {0.1, <sup>1</sup>, <sup>5</sup>, <sup>10</sup>}, and observed that <sup>λ</sup><sup>r</sup> <sup>=</sup> <sup>λ</sup><sup>c</sup> = 1 yields sparse and significant phenotypes. Increasing the regularization parameters further yielded phenotypes that were sparser but with fewer significant subgroups.

#### **3.2 Model Selection**

One way to determine the appropriate number of components for each model is to use the imputation error. Furthermore, the imputation error allows us to compare different models [8]. For each <sup>k</sup>−rank model for <sup>k</sup> ∈ {1,..., <sup>16</sup>}, 25 different sets of held-out values are sampled. By computing corresponding GLRM and NMF models for each of the { **<sup>Q</sup>**miss <sup>i</sup> }<sup>25</sup> <sup>i</sup>=1, held-out values are estimated, and reconstruction error statistics are computed. We use 15% missing values for each **Q**miss <sup>i</sup> . Both the median of the imputation error, as well as the whole spread need to be taken into consideration. These statistics show the generalization performance and can be used to select a model. Refer to [8] for more information about how to compute imputation errors for mixed statistical data types.

**Fig. 1.** Imputation error statistics for *<sup>k</sup>* ∈ {1*,...,* <sup>16</sup>}. The imputation error is meannormalized within each feature and by the number of held-out values.

Prior to building the final models, outliers are removed via the leverage score [24] given by **h** = diag(**X XX**<sup>−</sup><sup>1</sup> **X**), using the corresponding score matrices from the best-performing models in terms of the imputation error. Data points with a leverage score above the 99% quantile were removed (less than 50 subjects for both NMF and GLRM). The model selection process was repeated after the outliers were discarded.

#### **3.3 General Cervical Cancer Risk Factors**

For a first exploratory analysis, we investigate the imputation errors of GLRM and NMF in order to perform the model selection procedure described above. Figure <sup>1</sup> shows the imputation errors for <sup>k</sup> ∈ {1,..., <sup>16</sup>}. NMF models outperform GLRM for <sup>k</sup> ∈ {1,..., <sup>9</sup>}. After this range, the imputation error of NMF has high variation while GLRM is stable. In the range <sup>k</sup> ∈ {2,..., <sup>9</sup>}, the imputation error for both NMF and GLRM does not change much. We pick k = 4 since both models achieve almost the smallest error for this rank. For a 4-component model, GLRM and NMF are close with respect to their imputation error, and there are some similarities in their latent features.

Figure 2 shows the corresponding score matrices **X**nmf and **X**glrm, as well as the features matrices **Y**nmf and **Y**glrm for a 4-component model. The scores are arranged according to the risk groups which is indicated by colors (green for normal, yellow for low-grade, red for high-grade, gray for cancer). Furthermore, the higher-risk groups in the figures are over-represented (cf. Table 1, Appendix) in order to compensate for the skewness of the risk-group distribution. The horizontal line within each risk group shows the mean.

Interested in whether there is a difference between different diagnosis groups, especially between normals and low-grade/high-grade risk groups, we perform unpaired t-tests for each risk group within each component. This means that, for instance, for the first component, we perform a t-test between normals vs. ASCUS low-grade, normals vs. LSLIL (low-grade), normals vs. ASC-H (highgrade), and so on. In this way, the components that capture meaningful subgroups on the basis of which different risk groups might be separated can be determined. In this study, we focus only on the components that show statistical significance in terms of group difference between normals vs. all other groups. In Fig. 2 statistical significance is indicated by using gray or blue colored bars for **Y**. Blue bars indicate that the differences between normals and all other risk groups for the corresponding component are all statistically significant, i.e., for all six <sup>t</sup>-tests, we found <sup>p</sup>-value <sup>≤</sup> <sup>0</sup>.05/bk, where <sup>b</sup><sup>k</sup> = 6<sup>k</sup> is a Bonferroni correction that is applied for each <sup>k</sup>−component model, and takes into account all significance tests performed. Components that exhibit significant differences for each of the six tests will be called *significant components* in the following. Gray bars in Fig. 2 indicate that there is at least one risk-group within one component with non-significant result.

There are phenotypes that reflect higher-risk groups. Consider, for instance, the fourth component of the GLRM model, c glrm <sup>4</sup> : There are recognizably lower values for the normal diagnosis group (green) compared with higher risk groups (yellow, red, gray). The phenotype is mostly characterized by hormonal contraception usage, which is known to be a risk factor. Thus, it can be assumed that this component models a *general risk group*. This means that the latent

**Fig. 2.** Left side of each plot shows a subsample of **X**nmf and **X**glrm, respectively. The right side shows the latent features, **Y**nmf and **Y**glrm. All factor matrices **X**, **Y** are normalized by the norm of their columns and rows, respectively. *c*1, ..., *c*<sup>4</sup> denote components. Colors indicate corresponding diagnosis groups: green: normal, yellow: low risk, red: high-risk, gray: cancer. Blue bars for **Y**nmf and **Y**glrm indicate that the differences between normals and all other risk groups are significant while gray bars indicate that there is at least one subgroup that is non-significant. See Table 3 for a description of the features. (Color figure online)

feature space hints to risk factors. For each GLRM component, there exits one (arguably sufficiently similar) corresponding NMF component. For instance, c glrm <sup>1</sup> corresponds to c nmf <sup>1</sup> , and shows a phenotype mainly defined by the features age partner and age. For GLRM, the hormonal contraception subgroup (c glrm <sup>4</sup> ) shows significance between all pairwise <sup>t</sup>−tests, while this is not the case for the corresponding NMF subgroup. Summarizing, GLRM uncovers one more significant subgroup than NMF. Maybe surprisingly, a simple NMF model together with <sup>1</sup> regularization can find very similar subgroups.

#### **3.4 Phenotypes for Higher Number of Components**

Increasing the number of components and inspecting the corresponding models beyond what is shown in Fig. 2 might reveal other subgroups of interest. Investigating higher ranks is necessary because there are, by design, already (at least) nine categories of questions in the questionnaire. As we described earlier, these are related to: contraception, awareness of HPV, smoking, drinking habits, sexual activity, pregnancies, previous STDs and other personal information like marital status and education. Only a model with higher rank can extract or separate these subgroups, especially as phenotypes might also be characterized by a combination of features from different categories. While the imputation error is stable for GLRM for higher ranks <sup>k</sup> ∈ {8,..., <sup>16</sup>}, it is increasing for NMF. Several models with different number of components are considered in order to assess the sensitivity of the model to the number of components, and consistency of the components interpreted as phenotypes. We inspect the models for <sup>k</sup> ∈ {7, <sup>8</sup>, <sup>9</sup>, <sup>10</sup>} (see Figs. <sup>3</sup> and 4). The figure uses {<sup>c</sup> glrm <sup>1</sup> ,...,c glrm <sup>10</sup> } to denote the different components. Note that components from different models were grouped together based on the cosine similarity. This means that, for instance, c glrm <sup>3</sup> only contains components from a model with k = 9 and k = 10, while a corresponding component for k = 8 and k = 7 does not exist. Thus, {<sup>c</sup> glrm <sup>1</sup> ,...,c glrm <sup>10</sup> } have to be understood as a way to name different subgroups and not as an enumeration of components.

There are two important and general observations about the latent feature space. First, some related features are also grouped together within components. Features that are most distinct in components c glrm <sup>7</sup> , for instance, are related to sexual habits. Second, there are many components that are consistent across models with different number of components. We say that two or more components from different <sup>k</sup>−component models are consistent with respect to some subgroup if there is a consensus between their most important feature weights. In some cases, phenotypes are characterized by very few prevalent features that are related, e.g., the hormonal contraception/condom subgroup c glrm <sup>9</sup> . An example for a phenotype that is consistent for all four models, is the age partner + age (c glrm <sup>4</sup> ). We use the label *complex phenotype* to denote a subgroup that is characterized by features from more than two categories.

Besides showing the phenotypes, Figs. 3 and 4 also display the Bonferroniadjusted statistically significant subgroups (i.e., <sup>p</sup> <sup>≤</sup> <sup>0</sup>.05/bk). Within each component, we indicate statistical significance by using either filled or unfilled bars: If every risk-group (from low-risk to cancer) deviates significantly from the normal group, the corresponding bars are colored, otherwise only the edges are shown.

Components that are consistently visible for different number of components, k, and have statistically significant deviations between normals and every other risk-groups provide strong evidence for a meaningful phenotype within the questionnaire data. Important phenotypes uncovered by GLRM are for instance c glrm 9 (hormonal contraception, condom, number of partners) or c glrm <sup>4</sup> (age of first sexual partner + age). Figure 4 shows the phenotypes for NMF.

**Fig. 3.** Normalized **<sup>Y</sup>**glrm components {*<sup>c</sup>* glrm <sup>1</sup> *,...,c* glrm <sup>10</sup> } for models with *<sup>k</sup>* <sup>∈</sup> {7*,* <sup>8</sup>*,* <sup>9</sup>*,* <sup>10</sup>}. Different colored bars indicate factors from the different models. Filled bars correspond to significance between all risk-levels for a certain subgroup

**Fig. 4.** Normalized **<sup>Y</sup>**nmf components {*<sup>c</sup>* nmf <sup>1</sup> *,...,c* nmf <sup>10</sup> } for models with *<sup>k</sup>* ∈ {7*,* <sup>8</sup>*,* <sup>9</sup>*,* <sup>10</sup>}.

#### **4 Discussion**

For a 4-component model, NMF and GLRM both uncover phenotypes related to hormonal contraception, age + age of first sexual partner, and a complex phenotype that has a similar profile (with the exception of num partners). The subsequent analysis using higher-rank models with <sup>k</sup> ∈ {7, <sup>8</sup>, <sup>9</sup>, <sup>10</sup>} suggests that using loss functions that match the data type are better suited for phenotype discovery than using standard quadratic loss functions. GLRM uncovers more phenotypes compared to NMF. Furthermore, we observe that component c glrm 9 shows that GLRM is able to reveal a significant subgroup that is mainly defined by two binary variables: hormon contr and condom. Some components show that relating features are grouped together within components, e.g., c glrm <sup>3</sup> (contraception + sexual habits) or c glrm <sup>9</sup> (hormonal contraception, condom, number of partners).

Grouping of related features, consistency between different k-rank models, expert knowledge and significance between risk-levels provide evidence that (generalized) low-rank models can uncover important phenotypes. By design, the questionnaire mainly contains items that are known to be important risk factors. However, the results in this study show that significant components or subgroups that are defined by multivariate features exist. A subgroup that is found by both GLRM and NMF, as well as across different k-rank models within both models is the phenotype that is characterized by the age of the female participant as well as the age of the first sexual partner.

Some phenotypes that are defined by one or few very dominant features align with the literature on cervical cancer risk factors. The usage of hormonal contraception (c glrm <sup>5</sup> ), especially when used for long durations, is linked with increased risk of cervical cancer [12,25]. The number of sexual partners is another wellknown and important risk factor [26–28], and is for instance reflected by component (c glrm <sup>9</sup> ). Component <sup>c</sup> glrm <sup>3</sup> and especially component c nmf <sup>3</sup> group the number of sexual partners and the history of genital warts together which has been found previously [29]. Time since first intercourse [15] is a further contributing risk factor (c glrm <sup>9</sup> ). Our analysis suggests that investigating models with higher components uncovers important features and phenotypes that are not present for lower-rank models. For example, the binary feature hpv, which stands for knowledge about HPV, only appears in c glrm <sup>10</sup> in a pronounced way.

Using the score matrices **X**nmf and **X**glrm from all previously discussed models, we tried to find clusters, e.g., by using k-means clustering of all possible subspaces, defined by the columns of the score matrices. No distinct clusters were found that reflect the different risk-levels, which is probably due to the *uniform effect:* k-means clusters tend to have uniform sizes and hence cannot capture imbalanced risk-levels [30]. We assume that it is not possible to find distinct, non-overlapping clusters just based on questionnaire data, as the within-risklevel variation is too large. However, our results indicate that it is possible to uncover certain tendencies of risk-level groups.

## **5 Future Work**

Validating phenotypes based on unpaired t-tests between risk-level groups is a limitation as differences in the means might constitute a necessary but not sufficient condition for the clinical meaningfulness of a phenotype. Testing the validity of phenotypes, i.e., their significance in a clinical context, is a challenge that might be adequately addressed by methods from *survival analysis* [1,31]. In survival analysis, the *time until an event ('hazard') occurs* is studied. In our context, this time span could be defined as the time between the completion of the questionnaire and a high-grade risk result. Different phenotypes can be evaluated with respect to their hazard times which in turn can serve as a proxy to evaluate clinical significance. Figure 5 depicts an exemplary pipeline that uses a low-rank model to compute (sparse) phenotypes that are then examined by survival analysis. Such a pipeline could uncover the important phenotypes and questions and could be beneficial for personalizing cervical cancer screening programs, in order to find a better balance between too infrequent screening and over-screening.

**Fig. 5.** Pipeline from a low-rank model to personalized screening.

## **6 Conclusion**

In this study, (generalized) low-rank models were used for computational phenotype discovery in questionnaires that were sent out to gather meta data within the Norwegian cervical cancer screening programme. We used two decomposition methods, one that is agnostic to different data types and one that considers the different statistical data types via appropriate loss functions. Our results indicate that the careful construction of models that were tailored to the data types was worthwhile and revealed more significant phenotypes compared to the na¨ıve counterpart. Discovering clinically-meaningful phenotypes helps to identify risk groups that are characterized by a combination of features. Phenotypes in the Norwegian questionnaire data related to the age of the first sexual partner, hormonal contraception, number of sexual partners and contraception usage, among others were identified.

**Acknowledgement.** This work is part of the *DeCipher* project that is funded by the Research Council of Norway.

## **Appendix**

**Table 1.** Cervical cancer risk levels (cytology). The count column indicates the number of women in the corresponding risk-level group in the final data matrix. Diagnoses AGUS and ACIS are in the same high-grade(2) risk-level group.


**Table 2.** The four different loss functions (logistic loss, quadratic loss Kullback-Leibler divergence, ordinal hinge loss) that are used in posing the GLRM problem. For the ordinal loss function, *d* refers to the number of options for the corresponding question.


**Table 3.** Summary of included features from the questionnaire, grouped by their statistical data type. Abbreviations used within this table; w: week, m: month, CCS: cervical cancer screening. For delta-time features (dt ), *t* stands for 'time since'.


## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Automatic Unsupervised Clustering of Videos of the Intracytoplasmic Sperm Injection (ICSI) Procedure**

Andrea M. Stor˚as1,2(B) , Michael A. Riegler<sup>1</sup> , Trine B. Haugen<sup>3</sup> , Vajira Thambawita<sup>1</sup> , Steven A. Hicks1,2 , Hugo L. Hammer1,2 , Radhika Kakulavarapu<sup>3</sup> , P˚al Halvorsen1,2 , and Mette H. Stensen<sup>4</sup>

<sup>1</sup> Department of Holistic Systems, Simula Metropolitan Center for Digital Engineering, Oslo, Norway

andrea@simula.no

<sup>2</sup> Department of Computer Science, Faculty of Technology, Art and Design, OsloMet - Oslo Metropolitan University, Oslo, Norway

<sup>3</sup> Department of Life Sciences and Health, Faculty of Health Sciences,

OsloMet - Oslo Metropolitan University, Oslo, Norway

<sup>4</sup> Fertilitetssenteret, Oslo, Norway

**Abstract.** The *in vitro* fertilization procedure called intracytoplasmic sperm injection can be used to help fertilize an egg by injecting a single sperm cell directly into the cytoplasm of the egg. In order to evaluate, refine and improve the method in the fertility clinic, the procedure is usually observed at the clinic. Alternatively, a video of the procedure can be examined and labeled in a time-consuming process. To reduce the time required for the assessment, we propose an unsupervised method that automatically clusters video frames of the intracytoplasmic sperm injection procedure. Deep features are extracted from the video frames and form the basis for a clustering method. The method provides meaningful clusters representing different stages of the intracytoplasmic sperm injection procedure. The clusters can lead to more efficient examinations and possible new insights that can improve clinical practice. Further on, it may also contribute to improved clinical outcomes due to increased understanding about the technical aspects and better results of the procedure. Despite promising results, the proposed method can be further improved by increasing the amount of data and exploring other types of features.

**Keywords:** Unsupervised learning *·* Clustering *·* Human reproduction *·* Medical videos *·* Computer vision

## **1 Introduction**

Infertility is defined as a disease where an individual or a couple does not succeed in becoming clinically pregnant after a period of twelve months with regular, unprotected sexual intercourse [22]. Estimates suggest that about 190 million people worldwide are affected by infertility [9]. Assisted reproductive technology (ART) is used to treat infertility, and *in vitro* fertilization has been used for more than 40 years. The procedure called intracytoplasmic sperm injection (ICSI) [16] was introduced in the beginning of the 1990s, as a treatment for male factor infertility due to poor semen quality. Using this treatment, a single sperm is injected into the egg. The use of ICSI has greatly increased over the past years [1,7].

Visual examinations of the ICSI procedure are performed to evaluate technical aspects of the procedure. Some of the critical steps during the procedure are the selection of which sperm to inject, how the immobilization of sperm is performed, the technique used for injecting the sperm into the egg and the quality of the egg. Figures 1a to 1d illustrate different stages of the procedure, as well as debris. All video frames in the figures are from the data applied in the present study. Differences in results reported after ICSI treatments are partly explained by the level of experience of the embryologist performing the procedure, but technical variations might also be important [17]. For example, videos of the ICSI procedure can be applied for training purposes and refinement of internal procedures at the fertility clinic. Detailed understanding and control of the technical procedure may lead to improved clinical outcomes as well, such as higher fertilization and pregnancy rates. However, the examination and labeling of videos are time-consuming, and it requires knowledge about the critical steps during the procedure. Furthermore, medical professionals with such knowledge are not always available for labeling medical data, which complicates the process of obtaining labeled videos of high quality. Consequently, unsupervised learning is an attractive alternative, as it allows for training artificial intelligence (AI) models without labeled data. Because the outputs from the unsupervised models are not assigned distinct labels, some type of human interpretation of the results is required, but this still requires less work than manually labeling all samples in a dataset.

In this work, we present an unsupervised clustering technique that is able to cluster video frames from the ICSI procedure into groups that represent different stages of the procedure. This can make the examination of the videos more effective, and the health personnel will save time as they can watch the critical steps directly. Further on, focusing on the relevant parts of the procedure might contribute to easier detection of possible improvements, which could lead to improved clinical outcomes. Unsupervised clustering techniques have been developed for summarization of capsule endoscopy videos [8], detection of anomalies in computed tomography (CT) scans [3], segmentation of 3D medical images [14] and to diagnose coronavirus disease (CoVID19) from medical images [13]. None of these studies apply the same clustering algorithms as in the present paper, and they do not investigate data from the field of human reproduction. Regarding the use of AI to analyze videos of the ICSI procedure, one study trained a U-Net neural network to extract video frames of the oolemma, i.e., the cell membrane of the egg, during sperm injection [10]. To our knowledge,

**Fig. 1.** Examples of video frames representing sperm selection (a), sperm immobilization (b), sperm injection (c) and debris (d). The frames arrive from the data applied

this is the first time unsupervised clustering has been applied to video frames of the ICSI procedure. Thus, the main contributions of this work are:


in the presented work.

In the following, Sect. 2 provides an overview of the data and methods used in this work. This is followed by a description of our experiments and a presentation of our results in Sect. 3. Next, our findings and their implications on the clinical practice are discussed in Sect. 4. Finally, we provide a conclusion and possible future directions in Sect. 5.

#### **2 Data and Method**

Seven videos of artificial reproduction using the ICSI procedure are used in the experiments. The videos arrive from a pilot study that was conducted at Fertilitetssenteret in Oslo, Norway in 2021. Because the data is anonymized, no ethical approvals are required. The resolution is 1920*×*1080, and the frame rate per second is 25 for all videos. The video length ranges from 15 seconds to more than 2 minutes. The longest video includes sperm selection, immobilization and injection, while the other videos capture one or two of the stages. All videos were captured at 200*×* magnification with a DeltaPix camera. The ICSI procedure was performed using a Nikon ECLIPSE TE2000-S microscope connected with Eppendorf TransferMan 4m micromanipulator. The sperm cells were immobilized in 5 *µ*l Polyvinylpyrrolidone (PVP; CooperSurgical). The clinical outcome of the procedures is not included in the analysis.

Figure 2 provides an overview of the proposed workflow for unsupervised clustering. Video frames are extracted every second from the seven videos using the OpenCV library in Python [2]. The frequency of one second is chosen in order to extract frames reflecting the video contents without losing much information. The extracted frames are passed through a convolutional neural network (CNN), ResNet50 [6], that has been pre-trained on the ImageNet data set [20]. Features are extracted from the layer preceding the output layer, resulting in 2*,* 048 deep features per frame. Further on, dimensionality reduction with t-SNE [11] is applied on the extracted features. By reducing the dimensions of the data to two, the distribution of the video frames can easily be plotted for visual inspection, and the proposed method becomes more transparent. Moreover, dimensionality reduction has been applied prior to clustering of video frames to speed up the analysis [8]. t-SNE is chosen because it is an efficient technique for dimensionality reduction that has shown good performance on high-dimensional data points such as images [11]. When applying t-SNE, the user must specify the perplexity hyperparameter value, which can be thought of as a measure of the effective number of neighbors for each data point. Usually, the value should lie between 5 and 50 [11]. The perplexity values of 10, 15, 20 and 30 are tested for our data. The perplexity values chosen are based on the size of our dataset. The value should be smaller than the total number of samples to avoid one large cluster. On the other hand, values that are too small will result in local variations. The dimensionality reduction is evaluated by visually inspecting plots of the results, and identifying the plot with the most distinct clusters. The output from t-SNE is clustered using unsupervised clustering. Because the optimal number of clusters is not known, X-means clustering [19] is applied to determine the appropriate number of clusters. G-means clustering [5] is also tested. Both algorithms identify the optimal number of clusters in the provided data. They are wrappers around the k-means algorithm [12], and the final clusters depend on

**Fig. 2.** The workflow for the proposed clustering method. First, video frames are extracted from videos of the ICSI procedure. The frames are then passed through a pretrained ResNet50 for extraction of deep features. The dimensionality is reduced using t-SNE before the frames are clustered using either X-means or G-means.

the cluster initialization. Consequently, the results can vary between runs even though the dataset is the same. While X-means applies the Bayesian Information Criterion to find the appropriate cluster number, G-means, on the other hand, uses a Gaussian fit. The G-means algorithm has shown higher performance than X-means when the clusters are non-spherical [5]. All code is written in Python. Pyclustering is applied for unsupervised clustering [15], and Pytorch [18] is used for extracting the deep features from the pretrained ResNet50 model [6]. The source code is publicly available online<sup>1</sup>.

The quality of the clustering is evaluated by experienced embryologists working at Fertilitetssenteret in Oslo, Norway. The clusters are also categorized into which stage of the ICSI procedure they represent to evaluate the accuracy of the methods, but this is regarded as less important than the feedback from the embryologists.

## **3 Results**

In total, 359 images are extracted from the seven videos. The extracted deep features are reduced to two dimensions using t-SNE. Following visual inspection, the best perplexity hyperparameter value for t-SNE is 20, leading to the most distinct clusters. The results are shown in Fig. 3. Regarding the unsupervised clustering, the X-means algorithm suggested two clusters for the data when no

<sup>1</sup> https://github.com/AndreaStoraas/UnsupervisedClustering ICSI.

restrictions were set. However, this is not regarded as a sufficient number of clusters due to the variation between the frames. Consequently, the algorithms are restricted to estimating the number of clusters to lie between eight and 200. These limits are chosen to get clusters representing the variation in the dataset while not creating clusters that are too small with respect to the dataset size. When the X-means algorithm is forced to generate between eight and 200 clusters, the suggested number of clusters varies a lot for the same data set, ranging between 8 and 15 clusters. This makes it challenging to determine the appropriate number of clusters to use with the X-means algorithm. On the other hand, the G-means clustering algorithm is more stable, suggesting 29 or 30 clusters. Consequently, the clusters from the G-means algorithm are further investigated and evaluated by domain experts. The 29 clusters suggested by the G-means algorithm are indicated in Fig. 4.

The video frames in all of the 29 clusters are shown to four experienced embryologists working at Fertilitetssenteret in Oslo for evaluation of the quality of the clusters and detection of potential weaknesses of the method. An overall finding is that the clusters are dependent on the colors and the presence of edges in the frames. Moreover, two of the experts, one being a senior embryologist and the other one being a clinical embryologist, manually categorize the clusters after examination of typical examples of video frames from different clusters. Based on their feedback, the clusters are categorized into three subgroups that represent different critical stages of the ICSI procedure: sperm selection, sperm immobilization, and sperm injection. Video frames from these three subgroups can be studied more closely to inspect which sperm was selected, how it was immobilized and the technique applied when injecting the sperm into the egg. A fourth subgroup is also created for video frames containing bubbles and debris, here defined as noise.

The feedback from the embryologists is the main evaluation of the method. However, the accuracy of the clustering was also investigated as a secondary measure of performance. Based on visual inspection, 82% of the frames are automatically assigned to a cluster belonging to the same category. The categories were provided by the domain experts, as described above. The sperm selection seems to be the easiest part of the ICSI procedure to recognize. Still, some frames representing sperm immobilization were clustered together with sperm selection frames. Figures 5a and 5b show examples of video frames that were clustered as sperm selection according to the cluster categories from the domain experts. Figure 5a agrees with the cluster category, while Fig. 5b disagrees. Sperm immobilization is most difficult to recognize by the method, as all the clusters that include frames from this part also contain frames presenting sperm injection or sperm selection. Video frames that were clustered as sperm immobilization are provided in Figs. 5 c and 5d. Figure 5c agrees with the cluster category, while Fig. 5d disagrees.

**Fig. 3.** Plot of the 359 images after feature extraction with a pretrained convolutional neural network and dimensionality reduction with t-SNE. The frames are colored after which video they belong to.

#### **4 Discussion**

In this work, we show that unsupervised clustering can be applied for extracting video frames from different stages of the ICSI procedure. Despite promising results, there are some limitations to be discussed. First, the proposed technique is negatively affected by the colors and edges present in the frames. Indeed, colors and edges can vary between different dishes and droplets. To make the method more robust, features that do not rely on these properties will be explored for future experiments. To reduce the variation in colors, the frames can be converted into grayscale before they are analyzed. Further on, global features such as Tamura features [21] or fuzzy color and texture histogram [4] can be applied for less dependency on the presence of edges.

Moreover, our data set included seven videos from the same fertility center. Consequently, it is not known how well they generalize to larger data sets or other clinics. Since the method is sensitive to variations in colors and edges, the performance could be affected by the resolution, light and type of camera applied during the recording of the procedure. A follow-up study is planned with more videos, as well as information about the outcome, such as fertilization status, egg degeneration rate, embryo quality, embryo development, implantation and pregnancy rates.

**Fig. 4.** Results from unsupervised clustering. The 29 clusters identified by the G-means algorithm for the 359 video frames are indicated with circles.

Our results suggest that the sperm selection stage was easiest to detect with the proposed method. The sperm selection stage does not contain any needles or eggs, which might explain why this stage is more easily separated from the other stages. The stage that was most difficult to separate was the immobilization of sperm. This could be because the sperm cells are relatively small compared to the size of the injection needle, as well as the presence of noise in the frames, making it challenging to distinguish features from these frames from features representing only noise or sperm injection.

After manual inspection of the clusters, 82% of the video frames were placed in a cluster representing the same category, as defined by domain experts. Some frames were placed in clusters representing a different category, meaning that the medical experts will encounter some frames that are not appropriate for a given stage of the ICSI procedure. Nevertheless, since most of the frames in each cluster are similar, the clusters would still be useful for a more efficient examination of the ICSI procedure. With the additional experiments suggested above, the percentage of video frames disagreeing with their cluster category might also be further reduced. Further on, the labeled clusters from our experiments can potentially be used as labels in a supervised or semi-supervised learning framework in order to categorize new video frames.

Normally, the ICSI procedure is evaluated through live observation at the clinic. Alternatively, recordings of the procedure can be watched and labeled manually. According to the senior embryologist at Fertilitetssenteret, our method proposes a more time-efficient way to improve training and quality assessment of the ICSI procedure. Because this potentially leads to improved results of the

**Fig. 5.** Examples of video frames in clusters representing sperm selection (a, b) and sperm immobilization (c, d), according to the cluster categories provided by domain experts. Frames **a** and **c**agree with their assigned cluster labels, while **b** and **d** are video frames that were placed in clusters with a different category.

procedure, the clinical outcome, such as higher fertilization and pregnancy rates, might also improve. Finally, it could benefit couples suffering from infertility as well as the healthcare personnel performing the treatment.

## **5 Conclusion**

In this paper, we present an unsupervised method for clustering of video frames of the popular *in vitro* fertilization technique called ICSI. Deep features are extracted from the video frames before dimensionality reduction is applied. Clustering is then performed on the resulting data points. The clusters are evaluated by experienced domain experts, and the findings are discussed. The source code for the proposed method is available online.

In conclusion, our method is able to separate video frames into different stages of the ICSI procedure. This could be valuable in the fertility clinic in order to analyze ICSI videos more efficiently for training purposes, internal quality control and refinement of internal procedures. Further on, it might improve the results after treatments with ICSI, which in turn could lead to improved clinical outcomes such as higher fertilization and pregnancy rates.

For future work, we plan to experiment with features that are less affected by the change of color and the presence of edges in the video frames. We will also use a larger data set containing an increased number of videos preferably from different clinics to see if the method can be further improved.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Towards New AI Methods**

# **The Kernelized Taylor Diagram**

Kristoffer Wickstrøm1(B) , J. Emmanuel Johnson<sup>2</sup> , Sigurd Løkse<sup>1</sup> , Gustau Camps-Valls<sup>2</sup> , Karl Øyvind Mikalsen1,4 , Michael Kampffmeyer1,3 , and Robert Jenssen1,3

> <sup>1</sup> UiT the Arctic University of Norway, Tromsø, Norway kristoffer.k.wickstrom@uit.no

<sup>2</sup> Universitat de Val`encia, Val`encia, Spain

<sup>3</sup> Norwegian Computing Center, Oslo, Norway

<sup>4</sup> University Hospital of North Norway, Tromsø, Norway

**Abstract.** This paper presents the kernelized Taylor diagram, a graphical framework for visualizing similarities between data populations. The kernelized Taylor diagram builds on the widely used Taylor diagram, which is used to visualize similarities between populations. However, the Taylor diagram has several limitations such as not capturing non-linear relationships and sensitivity to outliers. To address such limitations, we propose the kernelized Taylor diagram. Our proposed kernelized Taylor diagram is capable of visualizing similarities between populations with minimal assumptions of the data distributions. The kernelized Taylor diagram relates the maximum mean discrepancy and the kernel mean embedding in a single diagram, a construction that, to the best of our knowledge, have not been devised prior to this work. We believe that the kernelized Taylor diagram can be a valuable tool in data visualization.

**Keywords:** Kernel methods · Taylor diagram · Data visualization

## **1 Introduction**

Clear and informative visualization of similarities between populations is a key component both in the development of methodology and in scientific publications. Depending on the particular use case, a wide range of techniques are available. One such visualization technique is the Taylor diagram (TD) [10], which was devised to relate several statistical quantities and allow for comparison of numerous data points in a single diagram. The TD has been frequently used in numerous application, and particularly in climate sciences [6,8]. However, the statistical quantities displayed in the TD does have some weaknesses that limit the usability of the diagram. For instance, one quantity in the diagram is the Pearson correlation coefficient, which only models linear relationship and can be sensitive to outliers. This curtails the TD, as many real-world applications use data with outliers and that are connected through non-linear relationships.

One of the most well-known and widely used approaches for measuring similarity in machine learning is through kernel methods [3,4]. At its core, a kernel function corresponds to a dot product in a high-dimensional feature space, where non-linear relationship between data in the input space can be linearly related in the new feature space. As long as the kernel is positive definite, the mapping to the feature space does not have to be computed explicitly.

In this paper we propose the kernelized Taylor diagram (KTD), which is illustrated in Fig. 1. This diagram relates well-known quantities from the kernel literature [9], namely the maximum mean discrepancy (MMD) and the kernel mean embedding in a single

**Fig. 1.** KTD: The radial distance from the origin to each point is proportional to the length of kernel mean embedding. The distance between the points is the maximum mean discrepancy.

figure. To the best of our knowledge, such a diagram has never been devised prior to this work. The KTD makes no assumptions on the distributions of the populations and can model a rich family of relationships between populations. The functionality of the proposed diagram is demonstrated on synthetic data. Code: https://github.com/Wickstrom/KernelizedTaylorDiagram.

## **2 The Kernelized Taylor Diagram**

*Taylor Diagram.* The TD was introduced as a tool that could relate several statistical quantities in a single figure [10]. It strength lies in the ability to compare numerous data points where it would otherwise be necessary to utilized several figures and/or tables. The theoretical starting point of the TD is the Pearson correlation coefficient ρ and the root-mean-squared-error E between two data points. [10] argued that neither are sufficient to capture potential similarities on their own, but in the aggregate the they are capable of detecting a wide range of differences between data points. Let **x** and **z** represent two D-dimensional vectors representing two data points. The correlation coefficient between **x** and **z** is defined as:

$$\rho = \frac{1}{D} \sum\_{d=1}^{D} \frac{(x\_d - \bar{x})(y\_d - \bar{z})}{\sigma\_x \sigma\_y},\tag{1}$$

where ¯x and ¯y are the mean values and σ<sup>x</sup> and σ<sup>y</sup> are the standard deviations. The root-mean-squared-error for mean centered data points is defined as:

$$\begin{split} E^2 &= \mathbb{E} \left[ \frac{1}{D} \sum\_{d=1}^D \left( (x\_d - \bar{x}) - (z\_d - \bar{z}) \right)^2 \right] \\ &= \underbrace{\frac{1}{D^2} \mathbb{E} \left[ \sum\_{d=1}^D (x\_d - \bar{x})^2 \right]}\_{\sigma\_x^2} + \underbrace{\frac{1}{D^2} \mathbb{E} \left[ \sum\_{d=1}^D (y\_d - \bar{y})^2 \right]}\_{\sigma\_y^2} - \underbrace{\frac{1}{D^2} \mathbb{E} \left[ \sum\_{d=1}^D (x\_d - \bar{x})(y\_d - \bar{y}) \right]}\_{\sigma\_{xy}} \\ &= \sigma\_x^2 + \sigma\_y^2 - 2\sigma\_x \sigma\_y \rho. \end{split} \tag{2}$$

The key point of the TD is recognize the relationship between the statistical quantities in Eq. 2 and the law of cosines:

$$c^2 = a^2 + b^2 - 2ab\cos(\theta). \tag{3}$$

Here, a and b are the lengths of two sides of a triangle with angle θ between each other and an opposite side of length c. The TD has seen widespread use in several domains such as in geophysical sciences [6,8]. Nevertheless, the TD has some key weaknesses that limits it functionality in many practical applications. The Pearson correlation coefficient has a number of limitations [1]. It can only model linear relationships [2], which can be restricting in many practical application. Also, the Pearson correlation coefficient is known be sensitive to outliers [1].

*The Kernelized Taylor Diagram.* To address such limitations, we propose the KTD, which uses well-know measures from the kernel literature to model similarities between populations. The starting point of the KTD is one of the most widely used distance measures in the kernel literature, namely the maximum mean discrepancy (MMD) [7], which measures the distance between two distributions where each distributions is represented by a mean embedding of the data. Let X ∼ P and Y ∼ Q, and *µ*<sup>x</sup> and *µ*<sup>y</sup> denoted the mean embedding vectors representing two distributions P and Q. Then, the MMD is defined as the norm between the two embeddings in a reproducing kernel Hilbert space H:

$$\begin{split} MMD^2 &= \|\mu\_x - \mu\_y\|\_{\mathcal{H}}^2 \\ &= \|\mu\_x\|\_{\mathcal{H}}^2 + \|\mu\_y\|\_{\mathcal{H}}^2 - 2\langle\mu\_x, \mu\_y\rangle\_{\mathcal{H}} \\ &= \|\mu\_x\|\_{\mathcal{H}}^2 + \|\mu\_y\|\_{\mathcal{H}}^2 - 2\|\mu\_x\|\_{\mathcal{H}}\|\mu\_y\|\_{\mathcal{H}} \frac{\langle\mu\_x, \mu\_y\rangle\_{\mathcal{H}}}{\|\mu\_x\|\_{\mathcal{H}}\|\mu\_y\|\_{\mathcal{H}}} \\ &= \|\mu\_x\|\_{\mathcal{H}}^2 + \|\mu\_y\|\_{\mathcal{H}}^2 - 2\|\mu\_x\|\_{\mathcal{H}}\|\mu\_y\|\_{\mathcal{H}}\cos\angle(\mu\_x, \mu\_y). \end{split} \tag{4}$$

In general, the true data distributions are not known, so the mean embeddings are replaced by empirical mean embeddings that are estimated based on samples from each distribution:

$$
\hat{\mu}\_x = \frac{1}{N} \sum\_{n=1}^N \kappa(\mathbf{x}\_n, \cdot),
\tag{5}
$$

where κ(·, ·) is a positive definite kernel that measures similarity between data points. If the kernel is characteristic [7], MMD is a metric and is zero only if the two distributions are equal. [5] showed that the well-known Gaussian kernel with kernel width <sup>σ</sup>, <sup>G</sup>σ(**x**i, **<sup>x</sup>**<sup>j</sup> ) = exp(||**x**<sup>i</sup> <sup>−</sup>**x**<sup>j</sup> ||<sup>2</sup>/2σ), is a characteristic kernel. Furthermore, MMD does not assume a particular distribution of the data, and can capture both non-linear and linear relationships between distributions.

Similarly as with the TD, we recognize the law of cosines in Eq. 4. The mean embeddings of the two distributions are the side lengths of a triangle with angle cos ∠(*µ*x, *µ*y) between each other and an opposite side with length equal to the MMD between the distributions. The KTD is shown in Fig. 1.

The length of the mean embeddings indicate the distance from the origin to each point in the KTD. For the Gaussian kernel, the kernel mean embedding captures all moments of the data population [9]. But it is not obvious how to interpret what information the kernel mean embeddings are illustrating in the diagram. However, the kernel mean embeddings can be related to uncertainty through the information potential (IP) from information theoretic learning [11], which allows for a similar interpretation of the KTD as the TD. That is, the kernel mean embeddings correspond to the σ in Eq. 2. In most applications, the IP must be estimated from data. In information theoretic learning, the IP is often estimated through the quadratic IP estimator using a Gaussian kernel [11]:

$$
\hat{V}\_{2,\sigma}(X) = \frac{1}{N^2} \sum\_{i,j}^N G\_{\sigma}(\mathbf{x}\_i, \mathbf{x}\_j). \tag{6}
$$

Next, the squared norm terms in Eq. 4 can be expressed as:

$$\|\|\mu\_x\|\|\_{\mathcal{H}}^2 = \frac{1}{N^2} \sum\_{i,j}^N \kappa(\mathbf{x}\_i, \mathbf{x}\_j). \tag{7}$$

If the mean embeddings are calculated using a Gaussian kernel, Eq. 6 and Eq. 7 are equivalent. Furthermore, the IP is related to entropy as follows:

$$
\hat{H}\_2(X) = -\log(\hat{V}\_{2,\sigma}(X)).\tag{8}
$$

Entropy measures the amount of information in a random variable, but can also be interpreted as a measure of uncertainty. High entropy indicates more variation in the data, while low entropy means that the data is clustered together. From Eq. 8 it is evident that when the information potential of X is high and the entropy will be low, and the opposite when the information potential of X is low. For the KTD, this means that random variables with a high value for the kernel mean embedding, and thus far from the origin, is associated with low uncertainty, and oppositely for a low value of the kernel mean embedding. This insight is important, as it allows us to relate concepts from the TD to the KTD.

## **3 Experiments**

To illustrate the functionality of the KTD we consider the case were the true distribution of the data is known and generate 1000 samples from 5 different populations. The reference distribution Xref is sampled from a standard normal distribution. The remaining populations are constructed as follows:

**Fig. 2.** Comparison of TD with the KTD on the data described in Sect. 3. The experiment illustrates how the TD is not able to capture non-linear dependencies and is sensitive to outliers, when compared with the proposed KTD.

$$\begin{aligned} X\_1 &\sim 2X\_{\text{ref}} + \epsilon, \quad X\_2 \sim \frac{X\_{\text{ref}}}{2} + \epsilon, X\_3 \sim X\_{\text{ref}}^2 + \epsilon, \\ X\_4 &\sim X\_{\text{ref}} \sin(X\_{\text{ref}}) + \epsilon, X\_O \sim \frac{X\_{\text{ref}}}{2} + \epsilon \text{ (with outliers)}, \end{aligned}$$

where ∼ N (0, 0.01). Population X<sup>1</sup> and X<sup>2</sup> are chosen to represent a linear relationship to the reference distribution, but with different scaling such that the standard deviation is different compared to the reference. Population X<sup>3</sup> and X<sup>4</sup> are chosen to represent a non-linear relationship with the reference. Lastly, X<sup>0</sup> is chosen to also have a linea relationship with the reference, but with two outliers added to the population. These two outliers are samples from N (10, 1).

Figure 2a displays the TD for these populations in relation to the reference distribution, while Fig. 2b shows the KTD. First, we consider Fig. 2a. Note that X<sup>1</sup> and X<sup>2</sup> both have a high similarity with the reference but with different length from the origin as a result of the difference in standard deviation. Next, both X<sup>3</sup> and X<sup>4</sup> are indicated as having low similarity with the reference, which is expected since the relationship is non-linear. Lastly, XO, which is almost identical to X<sup>2</sup> except for two outliers, shows a much lower similarity score. This illustrates how sensitive the TD can be to outliers.

In Fig. 2b, X<sup>1</sup> and X<sup>2</sup> also shows a related and high similarity score. However, note that compared to Fig. 2a, the distance to the origin have been changed, which is explained through the connection to the information potential described in Sect. 2. Next, both X<sup>3</sup> and X<sup>4</sup> are now indicated to have a high similarity with the reference, which illustrates that the KTD is capable of capturing nonlinear similarities. Lastly, X<sup>2</sup> and X<sup>O</sup> are located at almost the same point in the diagram, which shows that the KTD is robust against outliers in the data.

## **4 Conclusion**

In this article we proposed the KTD, which relates well-known quantities from the kernel literature in a single diagram. To the best of our knowledge, such a diagram has not been devised previously. Our proposed diagram addresses some key limitation in the widely used TD, such as modeling non-linear relationships and outliers in the data. In future works, we intend to examine the usability of the diagram on real-world data such as in climate applications. We believe that the KTD can be a useful tool in many machine learning applications.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Simulating University Application Data for Fair Matchings**

Meirav Segal(B) , Anne-Marie George, and Christos Dimitrakakis

Department of Informatics, University of Oslo, Oslo, Norway *{*meiravs,annemage,chridim*}*@ifi.uio.no

**Abstract.** This paper describes the design of a simulator (work in progress), that is based on Norwegian university admissions and exam data. It generates a realistic population of applicants to university programs, their preferences and study outcomes if they were admitted to the different study programs. This simulator is a versatile tool and can be used to analyse the current admission policy for Norwegian universities in terms of many fairness criteria that, e.g., take into account student preferences, gender balance, university preferences and study outcomes. More generally, it creates a benchmark for testing matching algorithms and fairness notions without revealing sensitive data.

## **1 Introduction**

The problem of school choice, in which students are assigned to schools, is a popular research area lying in the intersection of computer science, economics and mathematics. Apart from being challenging, it has great importance due to the significant influence a school choice could have on students' future trajectories. Formally it constitutes a matching or allocation problem under preference in a bipartite graph. Algorithmic solutions are employed in many countries and similar areas, e.g., for university admissions in Hungary [5], allocation of teachers to positions in France [11] and patient-donor matches for kidneys in many countries [4]. These solutions often involve stable allocations based on students and schools preferences, capacities and other constraints enforcing formal requirements or some fairness towards subgroups. Here, stability means that no single deviation from the computed allocation is more beneficial for any party involved [6].

When designing new algorithms and methods for school choice problems, there is an obvious need to evaluate its performance in practice, preferably using real-world data. While stable allocations consider candidates' preferences, other methods might take into account future study outcomes such as dropouts or grades. Nevertheless, real-life data cannot provide outcomes for students that never participated in a study program. Thus, there is a need for a simulator that generates realistic application data and provides study outcomes for any possible allocation of students to study programs.

For example, a recent study evaluated how different policies affect dropouts in the Chilean centralized college admission system using a simulator based on real data [7]. As the simulator itself was not published, the research community cannot generate new samples to explore other questions.

We describe the planned design of a simulator based on data of the Norwegian university admission system. This data is not openly available because it contains sensitive information, but a simulator can provide reliable data for analysis while preserving privacy. This will constitute a valuable benchmark for the research community. The simulator will generate a set of applicants with demographic features (e.g. age, gender, county), educational background (e.g. high school points), their preferences over study programs and study outcomes for each of these programs. Using these attributes, decision makers can evaluate new policies. For example, the current admission system grants bonus points based on age and gender. Through simulation, we can compare students' outcomes according to assignments given by the current system, with outcomes according to assignments based on a new policy, with increased or decreased bonus points.

## **2 University Admissions in Norway**

In Norway, the admission process for most undergraduate study programs at all public academic institutions is coordinated by the Norwegian Universities and Colleges Admission Service in a centralized manner [2]. This section describes the admission process and the available data for applications and study outcomes.

#### **2.1 Admission Process**

Candidates rank 10 study programs they wish to attend. Further, university programs specify their preferences over students by a point scheme based on grades and other factors such as age, gender or military service. In addition, candidates can apply through different quotas. For example, the first-time diplomas quota is designated for candidates who have completed and passed upper secondary school in normal time and are at most 21 years old. Other quotas are intended for underrepresented groups in specific programs. All candidates who do not fit special quotas, apply through the ordinary quota.<sup>1</sup> An applicant is classified as 'qualified' for a study program when they meet its minimum requirements.

In the main admission process, a specialized stable marriage algorithm is applied in order to find the candidate-optimal stable matching based on the applicants' and university programs' preferences [9]. At this point, each candidate is given at most one offer, to the highest ranked program that the candidate is qualified for (while maintaining stability).

After the candidates have accepted or declined the offers, study programs with remaining vacancies continue to make offers to available students over a period of one month in order of their preferences over the applicants.

<sup>1</sup> For more details of the point system and quotas see https://www.samordnaopptak. no/info/.

#### **2.2 Data**

Through the Norwegian Database for Statistics on Higher Education [1], we have been granted access to two data sets: Applications and Exams.

*Applications.* Application data<sup>2</sup> of all applications to all Norwegian university programs in the period 2017–2020. This data set includes 2*,* 265*,* 418 applications of <sup>∼</sup> <sup>500</sup>*,* 000 candidates to over 2*,* 000 study programs of 34 academic institutions. In each year approximately 180*,* 000 candidates apply, from which 50% are admitted.<sup>3</sup> Every application includes the following features:


*Exams.* Exam data<sup>4</sup> of all students at Norwegian universities for all their taken exams in the period 2017–2020. The exam data includes 5*,* 321*,* 519 records of exams taken by students, with an average of 8 exam grades per student. For each year there are grades of approximately 30*,* 000 courses throughout the different study programs. More specifically, we consider the following entries:


## **3 Simulator**

In this section we describe the (planned) components of the simulator individually. Figure 1a presents the process of generating a new population given the trained components. First, we generate background attributes of candidates. In addition, we generate the candidate's underlying type. This type determines the preference profile, which together with background features sets the priorities over programs. The outcome profile and outcomes over programs are determined similarly, but also affected by the preferences. Before the release of the complete simulator, we will incorporate differential privacy throughout the pipeline.

<sup>2</sup> https://dbh.hkdir.no/dbh-old/dokumentasjon/tabell.action?tabellId=379.

<sup>3</sup> Note that about 30% of the applications are of local admission, which means that the acceptance offers are made by each institution individually and not as part of the centralised process. Local admission is performed for master's programs or for special programs in which admission is based on additional criteria such as interviews.

<sup>4</sup> https://dbh.hkdir.no/dbh-old/dokumentasjon/tabell.action?tabellId=472.

**Fig. 1.** (a) Simulator pipeline diagram. (b) A possible analysis to perform on the simulated data. The priority of admission offers made according to gender, using original data with 0*.*01-differential privacy using the Laplace mechanism.

#### **3.1 Prepossessing and Training**

We provide details of how the selected models are trained from the bottom up:

*RankFM.* We train rankFM<sup>5</sup>, a factorisation machine model designed for ranked data with a loss function based on pairwise comparisons [10], to predict candidates' preferences over study programs. This model considers implicit data: a pairwise comparison is performed between programs ranked by the candidate and programs not ranked by that candidate, such that the latter are considered to have a lower priority. The comparison between ranked programs is not performed explicitly and is only addressed by giving larger confidence weights to higher ranked programs. Notably, rankFM allows us to incorporate candidates' features and study programs' features, such that their relation to candidates' preferences over programs is not lost. This model provides latent representations for candidates and study programs that, when combined, give a preference value for every student and university program pair.

*FastFM.* We train fastFM [3], a factorisation machine model with root-meansquare error for explicit feedback, to predict the students' study outcomes. Here, the candidates' features and the study programs' features include the latent representation obtained from rankFM. The features also include the preferences. The outcomes may be defined as average first year grade, normalised in [0*,* 1]. New latent representations of candidates and programs are provided by fastFM.

<sup>5</sup> https://github.com/etlundquist/rankfm.

*Gaussian Mixture Model (GMM).* A Gaussian Mixture Model is fitted to the concatenated latent representations of the candidates. This model allows us to sample new latent representations given a Gaussian identifier.

*Conditional Tabular GAN (CTGAN).* CTGAN<sup>6</sup> [12] is a deep learning based synthetic data generator for tabular data, that can learn from real data and generate synthetic clones with high fidelity. The CTGAN generator is trained using the candidates' feature data, including a GMM cluster identifier, which allows to generate candidate populations with similar distributions of features.

#### **3.2 Generating Student Features, Preferences and Outcomes**

To generate a new population, we can now follow Fig. 1a from top to bottom. We generate individual features for a new population of a given size using CTGAN. These features include demographic attributes such as gender and citizenship, but also the GMM cluster identifier. CTGAN is designed to generate new samples based on the train data distribution, so we expect the generated candidates to have a cluster identifier that fits their other features. Then, for each generated candidate we sample the specific pretrained Gaussian according to the their GMM cluster identifiers. As a result, we get the latent representations which holds information regarding the preferences and outcomes of candidates. Using the precalculated latent representations for study programs, we can predict the ranking and outcome of the study programs for each generated candidate.

#### **3.3 Simulating Admission Decisions (Work in Progress)**

Given the preferences of candidates and study programs, we can run the Gale-Shapley matching algorithm, a variation of Stable Marriage Matching for the hospitals-residents problem [8], to simulate the current admission system in Norway. The output will simulate the offers made in the first admission round. Given an initial offer, the candidate may decide to decline the offer. To simulate the second phase of acceptance, we simulate offers to applicants for programs in order of the programs' preferences (point scheme). We will use a classifier to predict offer acceptance by students for both first and second phase study offers.

Note that for this simulation the programs point schemes as well as their capacity has to be known. Neither are provided in the data, but can be deduced by the properties of the procedure of admissions in Norway. If a candidate has been accepted by a program (independent of whether they accept the offer and in which phase the offer was made), then


<sup>6</sup> https://github.com/sdv-dev/CTGAN.

By these observations, we gain pairwise comparisons between (qualified) candidates point scores for the different programs. We can then find program point schemes that are linear functions or polynomials over the candidate features that satisfy these relations. The capacities are either determined by (b) or can simply be assumed to be the number of students that accepted the study offer.

## **4 Fair Matchings**

The simulator, if implemented as described in Sect. 3, can be used to generate realistic instances of hospital/residents or school choice problems on which algorithmic solutions can be tested.

Fairness is particularly relevant to centralised school choice mechanisms and can be analysed for different solutions. We do not propose here a specific measure of fairness, but rather facilitate the analysis of different fairness notions. Apart from the usual notion of stability which only relies on preferences of candidates and programs, one can consider more elaborate objectives, such as equal preference satisfaction across groups based on gender or other demographic attributes. For example, Fig. 1b shows the satisfaction difference between men and women for the current admission system (real data). We can see that the percent of women who are offered admission to their first priority is higher than the equivalent percent of men. Yet, it is reversed for lower priorities. A possible explanation would be that women place 'safer' choices as their top priorities. Additional analysis could include satisfaction differences among counties or age groups, admission differences and outcome differences.

Furthermore, the possibility to predict study outcomes opens up the possibility to find allocations that offer equal predicted study success across groups. As the point scoring system of university programs is intended to rank the candidates by their capability of studying, it would be interesting to consider how much the point scheme correlates with the predicted study success of the students. One can measure how different a matching based on predicted study success instead of point schemes for university program preferences would be.

## **5 Conclusion**

The simulator presented here is planned to use a combination of factorisation machines and Gaussian mixture models to provide a real-world-based benchmark in a countrywide scale. Using this simulated data, one could measure welfare and fairness not only with respect to students' and university's preferences, but also with respect to their outcomes. We believe this simulator has the potential to advance the research efforts in school choice and illuminate new interesting problems that exist in current school assignment systems.

**Acknowledgements.** This work was supported by the Research Council of Norway under project number 302203. We are thankful for the data provided by the Norwegian Directorate for Higher Education and Skills.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Author Index**

Abida, Rabeb 65 Acar, Evrim 94 Becker, Florian 94 Biermann, Daniel 3 Camps-Valls, Gustau 125 Cleve, Anthony 65 Delbarre, Erwan 81 Dimitrakakis, Christos 132 Fritsch, Lothar 41 From, Anja Rosvold 52 George, Anne-Marie 132 Goodwin, Morten 3 Granmo, Ole-Christoffer 3 Halvorsen, Pål 111 Hammer, Hugo L. 81, 111 Haugen, Trine B. 81, 111 Hicks, Steven A. 111 Jaber, Aws 41 Jenssen, Robert 125 Johnson, J. Emmanuel 125 Kakulavarapu, Radhika 111 Kampffmeyer, Michael 125 Kille, Benjamin 52

Lentzas, Athanasios 16 Liiv, Innar 26 Løkse, Sigurd 125 Mikalsen, Karl Øyvind 125

Netland, Ingvild Unander 52 Nygård, Jan 94 Nygård, Mari 94

Ounoughi, Chahinez 26 Özgöbek, Özlem 52

Riegler, Michael A. 81, 111

Segal, Meirav 132 Sharma, Akriti 81 Siddiqui, Momin 81 Smilde, Age 94 Stensen, Mette H. 81, 111 Storås, Andrea M. 111

Thambawita, Vajira 111 Torim, Ants 26

Wickstrøm, Kristoffer 125

Yahia, Sadok Ben 26 Yazidi, Anis 41

Zouganeli, Evi 16